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for ordering the different contributions. As the word 'advances" suggests, each 
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last word on any problem is said, no subject is closed. 

Even though there are some overlaps in subject matter, it does not seem 
sensible to order this eclectic collection except by chance, and such an order 
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Abstract We present a unified view to models for text databases, proving new relations 

between empirical and theoretical models. A particular case that we cover is the 
Web. We also introduce a simple model for random queries and the size of their 
answers, giving experimental results that support them. As an example of the 
importance of text modeling, we analyze time and space overhead of inverted 
files for the Web. 



1.1 Introduction 

Text databases are becoming larger and larger, the best example being the 
World Wide Web (or just Web). For this reason, the importance of the infor- 
mation retrieval (IR) and related topics such as text mining, is increasing every 
day [Baeza- Yates & Ribeiro-Neto, 1999]. However, doing experiments in large 
text collections is not easy, unless the Web is used. In fact, although reference 
collections such as TREC [Harman, 1995] are very useful, their size are sev- 
eral orders of magnitude smaller than large databases. Therefore, scaling is an 
important issue. One partial solution to this problem is to have good models 
of text databases to be able to analyze new indices and searching algorithms 
before making the effort of trying them in a large scale. In particular if our 
application is searching the Web. The goals of this article are two fold: (1) to 
present in an integrated manner many different results on how to model nat- 
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ural language text and document collections, and (2) to show their relations, 
consequences, advantages, and drawbacks. 

We can distinguish three types of models: (1) models for static databases, 
(2) models for dynamic databases, and (3) models for queries and their an- 
swers. Models for static databases are the classical ones for natural language 
text. They are based in empirical evidence and include the number of differ- 
ent words or vocabulary (Heaps' law), word distribution (Zipf s law), word 
length, distribution of document sizes, and distribution of words in documents. 
We formally relate the Heaps' and Zipf s empirical laws and show that they 
can be explained from a simple finite state model. 

Dynamic databases can be handled by extensions of static models, but there 
are several issues that have to be considered. The models for queries and their 
answers have not been formally developed until now. Which are the correct 
assumptions? What is a random query? How many occurrences of a query are 
found? We propose specific models to answer these questions. 

As an example of the use of the models that we review and propose, we 
give a detailed analysis of inverted files for the Web (the index used in most 
Web search engines currently available), including their space overhead and 
retrieval time for exact and approximate word queries. In particular, we com- 
pare the trade-off between document addressing (that is, the index references 
Web pages) and block addressing (that is, the index references fixed size log- 
ical blocks), showing that having documents of different sizes reduces space 
requirements in the index but increases search times if the blocks/documents 
have to be traversed. As it is very difficult to do experiments on the Web as a 
whole, any insight from analytical models has an important value on its own. 

For the experiments done to backup our hypotheses, we use the collections 
contained in TREC-2 [Harman, 1995], especially the Wall Street Journal (WSJ) 
collection, which contains 278 files of almost 1 Mb each, with a total of 250 
Mb of text. To mimic common IR scenarios, all the texts were transformed to 
lower-case, all separators to single spaces (except line breaks); and stopwords 
were eliminated (words that are not usually part of query, like prepositions, 
adverbs, etc.). We are left with almost 200 Mb of filtered text. Throughout the 
article we talk in terms of the size of the filtered text, which takes 80% of the 
original text. To measure the behavior of the index as n grows, we index the 
first 20 Mb of the collection, then the first 40 Mb, and so on, up to 200 Mb. 
For the Web results mentioned, we used about 730 thousand pages from the 
Chilean Web comprising 2.3Gb of text with a vocabulary of 1.9 million words. 

This article is organized as follows. In Section 2 we survey the main em- 
pirical models for natural language texts, including experimental results and 
a discussion of their validity. In Section 3 we relate and derive the two main 
empirical laws using a simple finite state model to generate words. In Sections 
4 and 5 we survey models for document collections and introduce new models 
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for random user queries and their answers, respectively. In Section 6 we use 
all these models to analyze the space overhead and retrieval time of different 
variants of inverted files applied to the Web. The last section contains some 
conclusions and future work directions. 

1.2 Modeling a Document 

In this section we present distributions for different objects in a document. 
They include characters, words (unique and total) and their length. 

1.2.1 Distribution of Characters 

Text is composed of symbols from a finite alphabet. We can divide the sym- 
bols in two disjoint subsets: symbols that separate words and symbols that 
belong to words. It is well known that symbols are not uniformly distributed. 
If we consider just letters (a to z), we observe that vowels are usually more 
frequent than most consonants (e.g., in English, the letter 'e' has the highest 
frequency.) A simple model to generate text is the Binomial model. In it, each 
symbol is generated with certain fixed probability. However, natural language 
has a dependency on previous symbols. For example, in English, a letter 'f 
cannot appear after a letter 'c' and vowels, or certain consonants, have a higher 
probability of occurring after 'c'. Therefore, the probability of a symbol de- 
pends on previous symbols. We can use a finite-context or Markovian model 
to reflect this dependency. The model can consider one, two or more letters to 
generate the next symbol. If we use k letters, we say that it is a fe-order model 
(so the Binomial model is considered a 0-order model). We can use these mod- 
els taking words as symbols. For example, text generated by a 5-order model 
using the distribution of words in the Bible might make sense (that is, it can 
be grammatically correct), but will be different from the original [Bell, Cleary 
& Witten, 1990, chapter 4]. More complex models include finite-state models 
(which define regular languages), and grammar models (which define context 
free and other languages). However, finding the correct complete grammar for 
natural languages is still an open problem. 

For most cases, it is better to use a Binomial distribution because it is simpler 
(Markovian models are very difficult to analyze) and is close enough to reality. 
For example, the distribution of characters in English has the same average 
value of a uniform distribution with 15 symbols (that is, the probability of 
two letters being equal is about 1/15 for filtered lowercase text, as shown in 
Table 1). 

1.2.2 Vocabulary Size 

What is the number of distinct words in a document? This set of words is re- 
ferred to as the document vocabulary. To predict the growth of the vocabulary 
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size in natural language text, we use the so called Heaps' Law [Heaps, 1978], 
which is based on empirical results. This is a very precise law which states that 
the vocabulary of a text of n words is of size V = Kn@ = O(n^), where K 
and (3 depend on the particular text. The value of AT is normally between 10 
and 100, and (3 is a positive value less than one. Some experiments [Araujo et 
al, 1997; Baeza- Yates & Navarro,1999] on the TREC-2 collection show that 
the most common values for /3 are between 0.4 and 0.6 (see Table 1). Hence, 
the vocabulary of a text grows sub-linearly with the text size, in a proportion 
close to its square root. We can also express this law in terms of the number of 
words, which would change K. 

Notice that the set of different words of a language is fixed by a constant 
(for example, the number of different English words is finite). However, the 
limit is so high that it is much more accurate to assume that the size of the 
vocabulary is 0(n^) instead of 0(1) although the number should stabilize for 
huge enough texts. On the other hand, many authors argue that the number 
keeps growing anyway because of the typing or spelling errors. 

How valid is the Heaps' law for small documents? Figure 1 shows the evo- 
lution of the /3 value as the text collection grows. We show its value for up to 
1 Mb (counting words). As it can be seen, (3 starts at a higher value and con- 
verges to the definitive value as the text grows. For 1 Mb it has almost reached 
its definitive value. Hence, the Heaps' law holds for smaller documents but the 
/3 value is higher than its asymptotic limit. 



P 




r 

1.0 



n (Mb) 



Figure 1. 
collection. 
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Value of /3 as the text grows. We added at the end the value for the 200 Mb 



For our Web data, the value of j3 is around 0.63. This is larger than for 
English text for several reasons. Some of them are spelling mistakes, multiple 
languages, etc. 
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1.2.3 Distribution of Words 

How are the different words distributed inside each document?. An approx- 
imate model is the Zipfs Law [Zipf, 1949; Gonnet & Baeza-Yates, 1991], 
which attempts to capture the distribution of the frequencies (that is, number 
of occurrences) of the words in the text. The rule states that the frequency 
of the i-th most frequent word is 1/i 9 times that of the most frequent word. 
This implies that in a text of n words with a vocabulary of V words, the i-th 
most frequent word appears n/(i e Hv{6)) times, where Hv{0) is the harmonic 
number of order 6 of V, defined as 



BvV) = E Ji 



so that the sum of all frequencies is n. The value of 9 depends on the text. 
In the most simple formulation, 6 = 1, and therefore Hv(6) = O(logn). 
However, this simplified version is very inexact, and the case 6 > 1 (more 
precisely, between 1.7 and 2.0, see Table 1) fits better the real data [Araujo 
et al, 1997]. This case is very different, since the distribution is much more 
skewed, and Hy(6) = 0(1). Experimental data suggests that a better model is 
k/(c + i) where c is an additional parameter and k is such that all frequencies 
add to n. This is called a Mandelbrot distribution [Miller, Newman & Fried- 
man, 1957; Miller, Newman & Friedman, 1958]. This distribution is not used 
because its asymptotical effect is negligible and it is much harder to deal with 
mathematically. 

It is interesting to observe that if, instead of taking text words, we take 
n-grams, no Zipf-like distribution is observed. Moreover, no good model is 
known for this case [Bell, Geary & Witten, 1990, chapter 4]. On the other 
hand, Li [Li, 1992] shows that a text composed of random characters (separa- 
tors included) also exhibits a Zipf-like distribution with smaller 0, and argues 
that the Zipf distribution appears because the rank is chosen as an indepen- 
dent variable. Our results relating the Zipfs and Heaps' law (see next sec- 
tion), agree with that argument, which in fact had been mentioned well before 
[Miller, Newman & Friedman, 1957]. 

Since the distribution of words is very skewed (that is, there are a few hun- 
dred words which take up 50% of the text), words that are too frequent, such 
as stopwords, can be disregarded. A stopword is a word which does not carry 
meaning in natural language and therefore can be ignored (that is, made not 
searchable), such as "a", "the", "by", etc. Fortunately the most frequent 
words are stopwords, and therefore half of the words appearing in a text do 
not need to be considered. This allows, for instance, to significantly reduce the 
space overhead of indices for natural language texts. Nevertheless, there are 
very frequent words that cannot be considered as stopwords. 
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For our Web data, 6 — 1.59, which is smaller than for English text. This 
what we expect if the vocabulary is larger. Also, to capture well the central part 
of the distribution, we did not take in account very frequent and unfrequent 
words when fitting the model. A related problem is the distribution of A:-grams 
(strings of exactly k characters), which follow a similar distribution [Egghe, 
2000]. 

1.2.4 Average Length of Words 

A last issue is the average length of words. This relates the text size in 
words with the text size in bytes (without accounting for punctuation and other 
extra symbols). For example, in the different sub-collections of TREC-2 col- 
lection, the average word length is very close to 5 letters, and the range of 
variation of this average in each sub-collection is small (from 4.8 to 5.3). If 
we remove the stopwords, the average length of a word increases to little more 
than 6 letters (see Table 1). If we take the average length in the vocabulary, the 
value is higher (between 7 and 8 as shown in Table 1). This defines the total 
space needed for the vocabulary. Figure 2 shows how the average length of the 
vocabulary words and the text words evolve as the filtered text grows for the 
WSJ collection. 



Figure 2. 
line). 
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Heaps' law implies that the length of the words of the vocabulary increase 
logarithmically as the text size increases, and longer and longer words should 
appear as the text grows. This is because if for large n there are nP different 
words, then their average length must be log (r (n /? ) = /?log CT n at least (count- 
ing once each different word). However, the average length of the words in the 
overall text should be constant because shorter words are common enough (e.g. 
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stopwords). Our experiment of Figure 2 shows that the length is almost con- 
stant, although decreases slowly. This balance between short and long words, 
such that the average word length remains constant, has been noticed many 
times in different contexts. It can be explained by a simple finite-state model 
where the separators have a fixed probability of occurrence, since this implies 
that the average word length is one over that probability. Such a model is con- 
sidered in [Miller, Newman & Friedman, 1957; Miller, Newman & Friedman, 
1958], where: (a) the space character has probability close to 0.2, (b) the space 
character cannot appear twice subsequently, and (c) there are 26 letters. 



1.3 Relating the Heaps' and Zipf's Law 

In this section we relate and explain the two main empirical laws: Heaps' 
and Zipf s. In particular, if both are valid, then a simple relation between their 
parameters holds. This result is from [Baeza-Yates & Navarro, 1999]. 

Assume that the least frequent word appears 0(1) times in the text (this is 
more than reasonable in practice, since a large number of words appear only 
once). Since there are 9(n^) different words, then the least frequent word has 
rank i = 0(n^). The number of occurrences ofthis word is, by Zipf s law, 



n 



= e 






n 



i e H v {6) \n^H v (6) 

and this must be O(l). This implies that, as n grows, (3 = 1/6. This equal- 
ity may not hold exactly for real collections. This is because the relation is 
asymptotical and hence is valid for sufficiently large n, and because Heaps' 
and Zipf s rules are approximations. Considering each collection of TREC-2 
separately, (36 is between 0.80 and 1.00. Table 1 shows specific values for K 
and (3 (Heaps' law) and 6 (Zipf s law), without filtering the text. Notice that 
1/(3 is always larger than 6. On the other hand, for our Web data, the match is 
almost perfect, as (36 « 1. 



Text 


K 


S3 


V/3 


9 


Lett, (text) 


Len. (vocab.) 


Eq. a 


AP 


26.8 


0.46 


2.17 


1.87 


6.328 


8.012 


15.44 


DOE 


10.8 


0.52 


1.92 


1.70 


6.429 


8.423 


15.41 


FR 


13.2 


0.48 


2.08 


1.94 


6.096 


6.827 


15.64 


WSJ 


43.5 


0.43 


2.33 


1.87 


6.233 


7.453 


15.37 


ZIFF 


11.3 


0.51 


1.96 


1.79 


6.441 


7.181 


15.79 



Table 1. Experimental results for the parameters of Heaps' and Zipf s laws, as well as the 
average length of words and equivalent alphabet size. 



The relation of the Heapst' and Zipt's Laws is mentioned in a line of a paper 
by Mandelbrot [Mandelbrot, 1954], but no proof is given. In the Appendix 
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we give a non trivial proof based in a simple finite-state model for generating 
words. 

1.4 Modeling a Document Collection 

The Heaps' and Zipf s laws are also valid for whole collections. In par- 
ticular, the vocabulary should grow faster (larger f3) and the word distribution 
could be more biased (larger 9). That would match better the relation (36 = 1, 
which in TREC-2 is less than 1. However, there are no experiments on large 
collections to measure these parameters (for example, in the Web). In addi- 
tion, as the total text size grows, the predictions of these models become more 
accurate. 

1.4.1 Word Distribution Within Documents 

The next issue is the distribution of words in the documents of a collec- 
tion. The simplest assumption is that each word is uniformly distributed in 
the text. However, this rule is not always true in practice, since words tend to 
appear repeated in small areas of the text (locality of reference). A uniform 
distribution in the text is a pessimistic assumption since it implies that queries 
appear in more documents. However, a uniform distribution can have different 
interpretations. For example, we could say that each word appears the same 
number of times in every document. However, this is not fair if the document 
sizes are different. In that case, we should have occurrences proportional to 
the document size. A better model is to use a Binomial distribution. That is, if 
/ is the frequency of a word in a set of D documents with n words overall, the 
probability of finding the word k times in a document having w words (w < f) 
is 

Pr(k,n,w)=(™)p k (l- P r- k , p= f - 



For large w, we can use the Poisson approximation Pr(k,n,w) = |,e 



fe! e 
with X = w f/n. Some people apply these formulas using the average for all 

the documents, which is unfair if document sizes are very different. 

A model that approximates better what is seen in real text collections is 

to consider a negative binomial distribution, which says that the fraction of 

documents containing a word k times is 



i^) = ( a+ £ -i y(i+pr*- fc 



where p and a are parameters that depend on the word and the document col- 
lection. Notice that F(k) = D Pr(k,n,w) if we use w = n/D, the average 
number of words per document, so this distribution also has the problem of be- 
ing unfair if document sizes are different. For example, for the Brown Corpus 
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[Francis & Kucera, 1982] and the word "said", we have p = 9.24 and a = 0.42 
[Church & Gale, 1995]. The latter reference gives other models derived from a 
Poisson distribution. Another model related to Poisson which takes in account 
locality of reference is the Clustering Model [Thom & Zobel, 1992]. 



1.4.2 



Distribution of Document Sizes 



Static databases will have a fixed document size distribution. Moreover, de- 
pending on the database format, the distribution can be very simple. However, 
this is very different for databases that grow fast and in a chaotic manner, such 
as the Web. The results that we present next are based in the Web. 

The document sizes are self-similar [Crovella & Bestavros, 1996], that is, 
the probability distribution remains unchanged if we change the size scale. The 
same behavior appears in Web traffic. This can be modeled by two different 
distributions. The main body of the distribution follows a Logarithmic Normal 
curve, such that the probability of finding a Web page of x bytes is given by 

p{x) = 



,-(lnx-n) 2 /2a 2 



xa\/27r 



where the average (fi) and standard deviation (a) are 9.357 and 1.318, respec- 
tively [Barford & Crovella, 1998]. See figure of an example in 3 (from [Crov- 
ella & Bestavros, 1996]). 




2 4 6 

loq(File Size) 




2 3 4 J 6 
kgJFile Six* in Bytes] 



Figure 3. Left: Distribution for all file sizes. Right: Right tail distribution for different file 
types. All logarithms are in base 10. (Both figures are courtesy of Mark Crovella). 



The right tail of the distribution is "heavy-tailed". That is, the majority of 
documents are small, but there is a non trivial number of large documents. 
This is intuitive for image or video files, but it is also true for textual pages. A 
good fit is obtained with the Pareto distribution, that says that the probability 
of finding a Web page of a; bytes is 



p{x) = 



Xk x 
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for x > k, and zero otherwise. The cumulative distribution is 



F(.) = l-(|y 



where k and A are constants dependent on the particular collection [Barford 
& Crovella, 1998]. The parameter k is the minimum document size, and A is 
about 1.36 for textual data, being smaller for images and other binary formats 
[Crovella & Bestavros, 1996; Willinger & Paxson, 1998] (see the right side of 
Figure 3). Taking all Web documents into account, using k = 9.3Kb, we get 
A = 1.1, and 93% of all the files have a size below this value. The parameters 
of these distributions were obtained from a sample of more than 50 thousand 
Web pages requested by several users in a period of two months. Recent results 
show that these distributions are still valid [Barford et al, 1999], but the exact 
parameters for the distribution of all textual documents is not known, although 
average page size is estimated in 6Kb including markup (which is traditionally 
not indexed). 

1.5 Models for Queries and Answers 

1.5.1 Motivation 

When analyzing or simulating text retrieval algorithms, a recurrent problem 
is how to model the queries. The best solution is to use real users or to extract 
information from query logs. There are a few surveys and analyses of query 
logs with respect to the usage of Web search engines [Pollock & Hockley, 
1997; Jensen et al, 1998; Silverstein et al, 1998]. The later reference is the 
study of 285 million AltaVista user sessions containing 575 million queries. 
Table 2 gives some results from that study, done in September of 1998. Another 
recent study on Excite, shows similar statistics, and also the queries topics 
[Spink et al, 2002]. Nevertheless, these studies give little information about 
the exact distribution of the queries. In the following we give simple models 
to select a random query and the corresponding average number of answers 
that will be retrieved. We consider exact queries and approximate queries. An 
approximate query finds a word allowing up to k errors, where we count the 
minimal number of insertions, deletions, and substitutions. 

1.5.2 Random Queries 

As half of the text words are stopwords, and they are not typical user queries, 
stopwords are not considered. The simplest assumption is that user queries 
are distributed uniformly in the vocabulary, i.e. every word in the vocabulary 
can be searched with the same probability. This is not true in practice, since 
unfrequent words are searched with higher probability. On the other hand, 
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Measure 


Average value 


Range 


Number of words 
Number of operators 
Repetitions of each query 


2.35 
0.41 
3.97 


to 393 

to 958 

1 to 1.5 million 



Table 2. Queries on the Web: average number of words, Boolean operations, and query repe- 
titions. 



approximate searching makes this distribution more uniform, since unfrequent 
words may match with k errors with other words, with little relation to the 
frequencies of the matched words. In general, however, the assumption of 
uniform distribution in the vocabulary is pessimistic, at least because a match 
is always found. 

Looking at the results in the AltaVista log analysis [Silverstein et al, 1998], 
there are some queries much more popular than others and the range is quite 
large. Hence, a better model would be to consider that the queries also follow 
aZipf s like distribution, perhaps with 6 larger than 2 (the log data is not avail- 
able to fit the best value). However, the actual frequency order of the words 
in the queries is completely different from the words in the text (for example, 
"sex" and "xxx" appear between the top most frequent word queries), which 
makes a formal analysis very difficult. An open problem, which is related to 
the models of term distribution in documents, is whether the distribution for 
query terms appearing in a collection of documents is similar to that of docu- 
ment terms. This is very important as these two distributions are the base for 
relevance ranking in the vector model [Baeza- Yates & Ribeiro-Neto, 1999]. 
Recent results show that although queries also follow a Zipf distribution (with 
parameter #from 1.24 to 1.42 [Baeza- Yates & Castillo, 2001; Baeza- Yates & 
Saint- Jean, 2002]), the correlation to the word distribution of the text is low 
(0.2) [Baeza-Yates & Saint- Jean, 2002]. This implies that choosing queries at 
random from the vocabulary is reasonable and even pessimistic. 

Previous work by DeFazio [DeFazio, 1993] divided the query vocabulary in 
three segments: high (words representing the most used 90% of the queries), 
moderate (next 5% of the queries), and low use (words representing the least 
used 5% of the queries). Words are then generated by first randomly choosing 
the segment, the randomly picking a token within that segment. Queries are 
formed by choosing randomly one to 50 words. According to currently avail- 
able data, real queries are much shorter, and the generation algorithm does not 
produce the original query distribution. Another problem is that the query vo- 
cabulary must be known to use this model. However, in our model, we can 
generate queries from the text collection. 
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1.5.3 Number of Answers 

Now we analyze the expected number of answers that will be obtained us- 
ing the simple model of the previous section. For a simple word search, we 
will find just one entry in the vocabulary matching it. Using Heaps' law, the 
average number of occurrences of each word in the text is n/V — 0(n 1-;3 ). 
Hence, the average number of occurrences of the query in the text is 0(n 1- ^). 
This fact is surprising, since one can think in the process of traversing the text 
word by word, where each word of the vocabulary has a fixed probability of 
being the next text word. Under this model the number of matching words 
is a fixed proportion of the text size (this is equivalent to say that a word of 
length £ should appear about 0(n/a e ) times). The fact that this is not the case 
(demonstrated experimentally later) shows that this model does not really hold 
on natural language text. 

The root of this fact is not in that a given word does not appear with a 
fixed probability. Indeed, the Heaps' law is compatible with a model where 
each word appears at fixed text intervals. For instance, imagine that Zipfis 
law stated that the i-th word appeared n/2 l times. Then, the first word could 
appear in all the odd positions, the second word in all the positions multiple 
of 4 plus 2, the third word in all the multiples of 8 plus 4, and so on. The 
real reason for the sublinearity is that, as the text grows, there are more words, 
and one selects randomly among them. Asymptotically, this means that the 
length of the vocabulary words must be £ = f2(logn), and therefore, as the 
text grows, we search on average longer and longer words. This allows that 
even in the model where there are n/a e matches, this number is indeed o(n) 
[Navarro, 1998]. Note that this means that users search for longer words when 
they query larger text collections, which seems awkward but may be true, as 
the queries are related to the vocabulary of the collection. 

How many words of the vocabulary will match an approximate query? In 
principle, there is a constant bound to the number of distinct words which 
match a given query with k errors, and therefore we can say that 0(1) words 
in the vocabulary match the query. However, not all those words will appear 
in the vocabulary. Instead, while the vocabulary size increases, the number 
of matching words that appear increases too, at a lower rate. This is the same 
phenomenon observed in the size of the vocabulary. In theory, the total number 
of words is finite and therefore V = 0(1), but in practice that limit is never 
reached and the model V = 0(nP) describes reality much better. We show 
experimentally that a good model for the number of matching words in the 
vocabulary is 0{n v ) (with v < /?). Hence, the average number of occurrences 
of the query in the text is 0{n l ~P +v ) [Baeza-Yates & Navarro, 1999]. 
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1.5.4 Experiments 

We present in this section empirical evidence supporting our previous state- 
ments. We first measure V, the number of words in the vocabulary in terms of 
n (the text size). Figure 4 (left side) shows the growth of the vocabulary. Using 
least squares we fit the curve V = 78.81n 0,40 . The relative error is very small 
(0.84%). Therefore, = 0.4 for the WSJ collection. 



xlO 3 



200 




re (Mb) 



k = Z 



k = 2 



n(Mb) 



200 



Figure 4. Vocabulary tests for the WSJ collection. On the left, the number of words in the 
vocabulary. On the right, number of matching words in the vocabulary. 



We measure now the number of words that match a given pattern in the 
vocabulary. For each text size, we select words at random from the vocabulary 
allowing repetitions. In fact, not all user queries are found in the vocabulary in 
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practice, which reduces the number of matches. Hence, this test is pessimistic 
in that sense. 

We test k = 1, 2 and 3 errors. To avoid taking into account queries with 
very low precision (e.g. searching a 3-letter word with 2 errors may match too 
many words), we impose limits on the length of words selected: only words of 
length 4 or more are searched with one error, length 6 or more with two errors, 
and 8 or more with three errors. 

We perform a number of queries which is large enough to ensure a relative 
error smaller than 5% with a 95% confidence interval. Figure 4 (right side) 
shows the results. We use least squares to fit the curves 0.31n 014 for k = 1, 
0.61n 0-18 for k = 2 and 0.88n 019 for k = 3. In all cases the relative error 
of the approximation is under 4%. The exponents are the rvalues mentioned 
later in this article. One possible model for v is (3(1 — e~ a k ), because for 
k = we have v = and when A; — > oo, v — » (5, as expected. 

We could reduce the variance in the experiments by selecting once the set 
of queries from the index of the first 20 Mb. However, our experiments have 
shown that this is not a good policy. The reason is that the first 20 Mb will 
contain almost all common words, whose occurrence lists grow faster than the 
average. Most uncommon words will not be included. Therefore, the result 
would be unfair, making the results to look linear when they are in fact sublin- 
ear. 

1.6 Application: Inverted Files for the Web 
1.6.1 Motivation 

Web search engines currently available use inverted files that reference Web 
pages [Baeza- Yates & Ribeiro-Neto, 1999]. So, reference pointers should have 
as many bits as needed to reference all Web pages (currently, about 3 billion). 
The number and size of pointers is directly related with the space overhead of 
the inverted file. For the whole Web, this implies at least 600 GB. Some search 
engines also index word locations, so the space needed is increased. One way 
to reduce the size of the index is to use fixed logical blocks as reference units, 
trading the reduction of space obtained with an extra cost at search time. The 
block mechanism is a logical layer and the files do not need to be physically 
split or concatenated. In which follows we explain this technique in more 
detail. 

Assume that the text is logically divided into "blocks". The index stores all 
the different words of the text (the vocabulary). For each word, the list of the 
blocks where the word appears is kept. We call b the size of the blocks and 
r the number of blocks, so that n th rb. The exact organization is shown in 
Figure 5. This idea was first used in Glimpse [Manber & Sun Wu, 1994]. 
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Figure 5. The block-addressing indexing scheme. 



At this point the reader may wonder which is the advantage of pointing to 
artificial blocks instead of pointing to documents (or files), this way following 
the natural divisions of the text collection. If we consider the case of simple 
queries (say, one word), where we are required to return only the list of match- 
ing documents, then pointing to documents is a very adequate choice. More- 
over, as we see later, it may reduce space requirements with respect to using 
blocks of the same size. Moreover, if we pack many short documents in a log- 
ical block, we will have to traverse the matching blocks (even for these simple 
queries) to determine which documents inside the block actually matched. 

However, consider the case where we are required to deliver the exact posi- 
tions which match a pattern. In this case we need to sequentially traverse the 
matching blocks or documents to find the exact positions. Moreover, in some 
types of queries such as phrases or proximity queries, the index can only tell 
that two words are in the same block, and we need to traverse it in order to 
determine if they form a phrase. 

In this case, pointing to documents of different sizes is not a good idea 
because larger documents are searched with higher probability and searching 
them costs more. In fact, the expected cost of the search is directly related 
to the variance in the size of the pointed documents. This suggests that if the 
documents have different sizes it may be a good idea to (logically) partition 
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large documents into blocks and to put together small documents, such that 
blocks of the same size are used. 

In [Baeza- Yates & Navarro,1999], we show analytically and experimentally 
that using fixed size blocks it is possible to have a sublinear-size index with 
sublinear search times, even for approximate word queries. A practical exam- 
ple shows that the index can be O(n og4 ) in space and in retrieval time for ap- 
proximate queries with at most two errors. For exact queries the exponent low- 
ers to 0.85. This is a very important analytical result which is experimentally 
validated and makes a very good case for the practical use of this kind of in- 
dex. Moreover, these indices are amenable to compression. Block-addressing 
indices can be reduced to 10% of their original size [Bell et al, 1993], and the 
first works on searching the text blocks directly in their compressed form are 
just appearing [Moura et al, 1998a; Moura et al, 1998] with very good perfor- 
mance in time and space. 

Resorting to sequential searching to solve a query may seem unrealistic for 
current Web search engine architectures, but makes perfect sense in a near fu- 
ture when a remote access could be as fast as a local access. Another practical 
scenario is a distributed architecture where each logical block is a part of a Web 
server or a small set of Web servers locally connected, sharing a local index. 

As explained before, pointing to documents instead of blocks may or may 
not be convenient in terms of query times. We analyze now the space and later 
the time requirements when we point to Web pages or to logical blocks of fixed 
size. Recall that the distribution has a main body which is log-normal (that we 
approximate with a uniform distribution) and a Pareto tail. 

We start by relating the free parameters of the distribution. We call C the cut 
point between both distributions and / the fraction of documents smaller than 
C. Since Then the integral over the tail (from C to infinity) must be (1 — /), 
which implies that k = (1 — f) l ' x C. We also need to know the value of the 
distribution in the uniform part, which we call t, and it holds tC = /. For 
the occurrences of a word inside a document we use the uniform distribution 
taking into account the size of the document. 

1.6.2 Space Overhead 

As the Heaps' law states that a document with x words has x@ different 
words, we have that each new document of size x added to the collection will 
insert x$ new references to the lists of occurrences (since each different word 
of each different document has an entry in the index). Hence, an index of r 
blocks of size 6 takes 0(r6^) space. If, on the other hand, we consider the Web 
document size distribution, we have that the average number of new entries in 
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the occurrence list per document is 

(6.1) 

where p(x)was defined in Section 1.4.2. 

To determine the total size of the collection, we consider that r documents 
exist, whose average length is b* given by 

r°° tc 2 \k x 

b* = j o p( X ) xdx = — + (A _ 1)gA -! (6-2) 

and therefore the total size of the collection is 

„ = rf, = r (— + (A _ 1)c ^, j («) 

The final size of the occurrence lists is (using Eq. (6.1)) 
(tC l +P \k x \ 

r [TTJ + (\-0)c>-0 j (6A) 

We consider now what happens if we take the average document length 
and use blocks of that fixed size (splitting long documents and putting short 
documents together as explained). In this case, the size of the vocabulary is 
0{nP) as before, and we assume that each block is of a fixed size b = zb*. We 
have introduced a constant z to control the size of our blocks. In particular, if 
we use the same number of blocks as Web pages, then z = 1. Then the size of 
the lists of occurrences is 

, Mlfl r (tC 2 \k x 



!-/* V 2 (A-l)C*- 1 

(using Eq. (6.3)). Now, if we divide the space taken by the index of documents 
by the space taken by the index of blocks (using the previous equation and 
Eq. (6.4)), the ratio is 

, t . , r ( tc l +f> , \k\ \ tc , Afc A 

document index _ y 1+/3 ^ (\-p)c*-P ) _ x _$ T+p + {\-p)c^ 

block index ~ r /^a Afc\ ^ ~ * (tc , Afc* ^ 

F 3 ? V 2 (A-i)C*-V VT + (A-l)CV 

= ^ gfl + A-g (6 _ 5) 



(*+*s#y 



1 8 RECENTS AD VANCES IN APPLIED PROBABILITY 

which is independent of r, n, k and C; and is about 85% for z = 1, / = 0.93 
and — 0.4..0.6. We approximated / = 0.93, which corresponds to all the 
Web pages, because the value for textual pages is not known. This shows that 
indexing documents yields an index which takes 85% of the space of a block 
addressing index, if we have as many blocks as documents. Figure 6 shows the 
ratio as a function of A and /3. As it can be seen, the result varies slowly with 
(3, while it depends more on A (tending to 1 as the document size distribution 
is more uniform). 

The fact that the ratio varies so slowly with /3 is good because we already 
know that the (3 value is quite different for small documents. As a curiosity, see 
that if the documents sizes were uniformly distributed in all the range (that is, 
letting / — > l)the ratio would become 2^/(1 + f3), which is close to 0.94 for 
intermediate j3 values. On the other hand, letting / — > (as in the simplified 
model [Crovella & Bestavros, 1996]) we have a ratio near 0.83. As another 
curiosity, notice that there is a ft value which gives the minimum ratio for 
document versus block index (that is, the worst behavior for the block index). 
This is (3 = .57 for z = 1, quite close to the real values (0.63 in our Web 
experiments). 

If we want to have the same space overhead for the document and the block 
indices, we simply make the expression of Eq. (6.5) equal to 1 and obtain 
z « 1.27.. 1.48 for (3 = 0.4..0.6, that is, we need to make the blocks larger 
than the average of the Web pages. This translates into worse search times. By 
paying more at search time we can obtain smaller indices (letting z grow over 
1.48). 

1.6.3 Retrieval Time 

We analyze the case of approximate queries, given that for exact queries 
the result is the same by using v = 0. The probability of a given word to be 
selected by a query is 0{n v ~$). The probability that none of the words in a 
block is selected is therefore (1 — 0(n u ~^)) b . The total amount of work of an 
index of fixed blocks is obtained by multiplying the number of blocks (r) times 
the work to do per selected block (b) times the probability that some word in 
the block is selected. This is 

e (rb (l - (l - rf-Pyy) = e(n(l-e- e ( fc /"^))) (6.6) 

where for the last step we used that (1 — x) y = e yln ^~ x ^ = e y(~ x +°( x )) = 
9( e -e(y*)) provided x = o(l). 

We are interested in determining in which cases the above formula is sub- 
linear in n. Expressions of the form "1 — e~ x " are O(x) whenever x = o(l) 
(since e~ x = 1 — x + 0{x i )). On the other hand, if x = f2(l),then e~ x is far 
away from 1, and therefore "1 — e~ x " is f2(l). 
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Figure 6. On the left, ratio between block and document index as a function of A for fixed 
/3 = 0.5 (the dashed line shows the actual A value for the Web). On the right, the same as a 
function of /3 for A = 1.36 (the dashed lines enclose the typical /3 values). In both cases we use 
/ = 0.93 and the standard z — 1. 



For the search cost to be sublinear, it is thus necessary that 
When this condition holds, we derive from Eq. (6.6) that 



Time = (nP + bn l ~ 0+ "\ 



b = o{nP~ u ). 



(6.7) 
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We consider now the case of an index that references Web pages. As we 
have shown, if a block has size x then the probability that it has to be traversed 
is (1 — e~®( x / n ")). We multiply this by the cost x to traverse it and integrate 
over all the possible sizes, so as to obtain its expected traversal cost (recall 
Eq. (6.6)) 

x(l - e- @ W nl3 ~^)p(x)dx 



Jk 



which we cannot solve. However, we can separate the integral in two parts, (a) 
x = o(nP~ v ) and (b) x = Sl(nP~ u ). In the first case the traversal probability 
is 0(x/n^~ u ) and in the second case it is fi(l). Splitting the integral in two 
parts and multiplying the result by r = n/b* we obtain the total amount of 
work: 

U + A^T VU 2-\) n + (2-A)(A-l) n )) 

where since this is an asymptotic analysis we have considered C — o(n l3 ^ u ), 
as C is constant. 

On the other hand, if we used blocks of fixed size, the time complexity 
(using Eq. (6.7)) would be 0(bn x ~P +l '), where b = zb*. The ratio between 
both search times is 

doc, index traversal = ft ^ (/ ,_ v)(2 _ A) \ 



block index traversal 



= e(r 



which shows that the document index would be asymptotically slower than 
a block index as the text collection grows. In practice, the ratio is between 
O(n 02 ) and O(n 0A ). The value of z is not important here since it is a constant, 
but notice that k is usually quite large, which favors the block index. 

1.7 Concluding Remarks 

The models presented here are common to other processes related to human 
behavior [Zipf, 1949] and algorithms. For example, a Zipf like distribution 
also appears for the popularity of Web pages with 9 < 1 [Barford et al, 1999]. 
On the other hand, the phenomenon of sublinear vocabulary growing is not ex- 
clusive of natural language words. It appears as well in many other scenarios, 
such as the number of different words in the vocabulary that match a given 
query allowing errors as shown in Section 5, the number of states of the de- 
terministic automaton that recognizes a string allowing errors [Navarro, 1998], 
and the number of suffix tree nodes traversed to solve an approximate query 
[Navarro & Baeza- Yates, 1999]. We believe that in fact the finite state model 
for generating words used in Section 3 could be changed for a more general 
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one that could explain why is this behavior so extended in apparently very 
dissimilar processes. 

By the Heaps' law, more and more words appear as the text grows. Hence, 
0(logn) bits are necessary in principle to distinguish among them. However, 
as proved in [Moura et al, 1998], the entropy of the words of the text remains 
constant. This is related to Zipf s law: the word distribution is very skewed 
and therefore they can be referenced with a constant number of average bits. 
This is used in [Moura et al, 1998] to prove that a Huffman code to compress 
words will not degrade as the text grows, even if new words with longer and 
longer codes appear. This resembles the fact that although longer and longer 
words appear, their average length in the text remains constant. 

Regarding the number of answers of other type of queries, like prefix search- 
ing, regular expressions and other multiple -matching queries, we conjecture 
that the set of matching words grows also as 0{n v ) if the query is going to be 
useful in terms of precision. This issue is being considered for future work. 

With respect to our analysis of inverted files for the Web, our results say 
that using blocks we can reduce the space requirements by increasing slightly 
the retrieval time, keeping both of them sublinear. Fine tuning of these ideas 
is matter of further study. On the other hand, the fact that the average Web 
page remains constant even while the Web grows shows that sublinear space is 
not possible unless block addressing is used. Hence, future work includes the 
design of distributed architectures for search engines that can use these ideas. 

Finally, as it is very difficult to do meaningful experiments in the Web, we 
believe that careful modeling of Web pages statistics may help in the final 
design of search engines. This can be done not only for inverted files, but also 
for more difficult design problems, such as techniques for evaluating Boolean 
operations in large answers and the design of distributed search architectures, 
where Web traffic and caching become an issue as well. 
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Appendix 

Deducing the Heaps' Law 

We show now that the Heaps' law can be deduced from the simple finite state model men- 
tioned before. Let us assume that a person hits the space with probability (1 — p) and any other 
letter (uniformly distributed over an alphabet of size a) with probability p, without hitting the 
space bar twice in a row (see Figure A. 1). 

Since there are no words of length zero, the probability that a produced word is of length £ 
isp _1 (1 — p), since we have a geometric distribution. The expected word length is 1/(1 — p), 
from where p = 0.84 can be approximated since the average word length is close to 6.3 as 
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Figure A. 1. Simple finite-state model for generating words. 

shown later, for text without stopwords. For this case, we use a = 15, which would be the 
equivalent number of letters for text generated using a uniformly distributed alphabet. 

On average, if n words are written, p (1 — p)n of them are of length t. We count now how 
many of these are different, considering only those of length i. Each of the a 1 strings of length 
^is different from each written word of length I with probability (1 — 1/cr'), and therefore it is 
never written in the whole process with probability 

= e - ti^ih (i +Q (i/ g ')) 



•ti 

r «(\_ e - ^~ 1 < 1 r >)n <i+o(i/ g ')A 



from where we obtain that the total number of different words that are written is 



Now we consider two possible cases 

(a) x = p t ~ l {l-p)n/a t = o(l) 

The condition is equivalent to £ = w(L), where L = ln((l — p)n/p)/ln(a/p), i.e. 
large I. In this case, e~ x = 1 — x + o(x), and hence the number of strings is 

a t(£liLz2hy i + (l/S)) = p<-\l-p)n(l + 0(l/a e )) 

that is, basically all the written words are different. 

(6) x = p e ~ l (l-p)n/cr l = f}(l) 

In this case, e~ x is far away from 1, and therefore a e (l — e~ x ) = 0(a). That is, £ is 
small and all the different words are generated. 

We sum now all the different words of each possible length generated, 

[Li oo 

(=1 t=[L+li 

and obtain that both summations are 
which is of the form 0(nr). 
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The value obtained with p = 0.84 and a — 15 is n 0,94 , which is much higher than reality. 
Consider, however, that it is unrealistic to assume that all the 15 or 26 letters are equally probable 
and to ignore the dependencies among consecutive letters. In fact, not all possible combinations 
of letters are valid words. Even in this unfavorable case, we have shown that the number of dif- 
ferent words follows Heaps' law. More accurate models should yield the empirically observed 
values between 0.4 and 0.6. 

Deducing the Zipf s Law 

We show now that also the Zipf's law can be deduced from the same model. From the 
previous Heaps' result, we know that if we consider words of length £ — O(L) then all the 
different (recombinations appear, while if £ = w(L)then all the p ~ (1 —p)n words generated 
are basically different. 

Since shorter words are more probable than longer words, we know that, if we sort the 
vocabulary by frequency (from most to least frequent), all the words of length smaller than £ 
will appear before those of length £. 

In the case £ = O(L), the number of different words shorter than £ is 

while, on the other hand, if £ = w(L), the summation is split in all those smaller than L and 
those between L and £: 



[LI l-l 




XX + X p i " 1 (i-p)" = 


6(a L+1 + n(p L -p <-1 )) 


i=l i=|L+lJ 





which, since L = ln((l - p)n/p)/ln(a/p),is 9(((1 - p)n/p) 1/(1+,os » (1/p)) ). 

We relate now the result with Zipf's law. In the case of small £, we have that the rank of 
the first word of length £ is i = 0((T ). We also know that, since all the a 1 different words of 
length ^appear, they are uniformly distributed, andp _1 (1 —p)n words of lengthy are written, 
then the number of times each different word appears is 

p l ~ x (\—p)n _ (l—p)n/p _ (1 — p)n/p _ (l—p)n/p 
a* ~~ ipIvY ~ ( a <)iog„( CT /p) - ji+io g<r (i/ P ) 

which, under the light of Zipf's law, shows that 6=1 + log CT (l/p). 

We consider the case of large £ now. As said, basically every typed word of this length is 
different, and therefore its frequency is 1. Since this must be 0(n/i ), we have 

0(n) = i e = ((l-p)n/p) < ' /(1+los ^ 1/p » 

where the last step considered that, as found before, the rank i of this word is 
((1 - p)n/p) 1/(1+log " (1/p)) . Equating the first and last term yields again 9=1 + log ff (1/p). 
Hence, the finite state model implies Zipf's law, moreover, the 9 value found is precisely 
1/0, where is the value for Heaps' law. As we have shown, this relation must hold when 
both rules are valid. The numerical value we obtain for 9 assuming p = 0.84 and a uniform 
model over 15 letters is = 1.06, which is also far from reality but is close to the Mandelbrot 
distribution fitting obtained by Miller et al [Miller, Newman & Friedman, 1957] (they use p = 
0.82). Note also that the development of Li [Li, 1992] is similar to ours regarding the Zipf s 
law, although he uses different techniques and argues that this law appears because the frequency 
rank is used as independent variable. However, we have been able to relate and 9. 
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Abstract In this paper, we partially review probabilistic and time series models in finance. 

Both discrete and continuous-time models are described. The characterization 
of the No-Arbitrage paradigm is extensively studied in several financial market 
contexts. As the probabilistic models become more and more complex to be 
realistic, the Econometrics needed to estimate them are more difficult. Conse- 
quently, there is still much research to be done on the link between probabilistic 
and time series models. 

Keywords: Asset Pricing, CAPM, Choquet integral, Diffusion process, GARCH, Stochastic 
Volatility, Term Structure, Value at Risk. 

2.1 Introduction 

Uncertainty plays a central role in financial theory and its empirical imple- 
mentation. The objective of this paper is to review the connection between the 
theory and the empirical analysis in the area of Finance. It is obvious that the 
scope of the subject is too wide and, consequently, we will not be able to cover 
all contributions in the area. Therefore, in the framework of probabilistic mod- 
els, we focus on those pricing models reflecting the absence of arbitrage and 
free-lunch. The problem of valuation and hedging of contingent claims (risks) 
presents important difficulties when markets imperfections are met. The char- 
acterization of No-Arbitrage (NA) is extensively studied in section 2. Pricing 
of contingent claims when markets are subject to portfolio constraints, trans- 
actions costs and taxes as well as new results for nonlinear pricing along with 



28 RECENTS ADVANCES IN APPLIED PROBABILITY 

a universal framework for pricing financial and insurance risks are reviewed in 
this section. 

Section 3 reviews the main time series models devoted to the analysis of 
financial returns. We start describing models for the conditional mean usually 
fitted to test whether financial prices are predictable. In this sense, it is gener- 
ally accepted that asset returns are close to be martingale difference processes. 
However, they are not independent because of the often observed dependence 
of some transformations related with second moments. Consequently, we then 
describe models to represent the dynamic evolution of conditional variances 
and covariances of high frequency returns. Finally, section 3 reviews the mod- 
els recently proposed to represent the main empirical properties of ultra high 
frequency (intra-daily) returns. 

In section 4, we focus on the link between probabilistic models and Finan- 
cial Econometrics. We show that the estimation of realistic financial models 
for asset prices are, in general, difficult and much research remains to be done 
in this area. In particular, in this section, we describe the empirical implemen- 
tation of the CAPM as well as the estimation procedures of the term structure, 
the VaR and continuous time diffusions. 

The paper finishes in section 5 with a summary of the main conclusions. 

2.2 Probabilistic models for finance 

A classical problem in mathematical finance is the pricing of financial as- 
sets. The usual solution of this problem involves the so-called Fundamental 
Theorem of Asset Pricing. This result ensures that the assumption of NA is 
essentially equivalent to the existence of an equivalent martingale measure, in 
a perfect financial market. The NA assumption amounts to saying that there is 
no plan yielding some profit without a countervailing threat of loss. It prevents 
the existence of zero cost portfolios with positive return. The problem of fair 
pricing of financial assets is then reduced to taking their expected values with 
respect to equivalent martingale measures. Initial results on the Fundamental 
Theorem of Asset Pricing hold in the case of finite number of assets and a finite 
discrete time models; see Harrison and Kreps (1979) and Harrison and Pliska 
(1981). 

Various generalizations are now available in the literature. For discrete in- 
finite or continuous time, the notion of "no free lunch" or "no free lunch with 
bounded (vanishing) risk" is needed, which is a slightly stronger version of 
the non-arbitrage condition; see, for example, Dalang et al. (1989), Back and 
Pliska (1991) and Schachermayer (1992). In these generalizations, securities 
markets are assumed to be frictionless, i.e. without considering transaction 
costs. For discrete infinite case see Schachermayer (1994). For continuous 
time models see Delbaen (1992) orDelbaen and Schachermayer (1994, 1998); 
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see also Duffie and Huang (1986), Striker (1990) and Kabanov and Kramkov 
(1994). 

2.2.1 The Fundamental Theorem of Asset Pricing 

The mathematical translation of this concept uses martingale theory and 
stochastic analysis. Under the assumption that the R -valued price process 
{St}teR+ re fl ect economically meaningful ideas and does not generate arbi- 
trage profits, the Fundamental Theorem of Asset Pricing allows the probability 
P on the underlying probability space (f2, F, P) to be replaced by an equiva- 
lent measure Q such that {St}teR+ becomes a (local) martingale under the new 
measure. The information structure is given by a filtration (Ft)teT- Following 
Delbaen and Schachermayer (1994, 1998), there should be no trading strat- 
egy H for the process S, such that the final payoff described by the stochastic 
integral (H.S)oo, is a nonnegative function, strictly positive with positive prob- 
ability. 

A buy-and-hold strategy can be described, from the mathematical point of 
view, as an integrand of the form H = f-l(T lt T 2 }> where T\ < Ti are stop- 
ping times and / is F^ -measurable. The interpretation of this integrands 
is clear: when time Ti(w) comes up, buy f(w) units of the financial asset, 
keep them until time T^iw) and sell. Stopping times are interpreted as signals 
coming from available information and this is one reason why, in mathemat- 
ical finance, the filtration and further concepts such as predictable processes, 
are so relevant. Even if the process S is not a semi -martingale, the stochastic 
integral (H.S) for a buy-and-hold strategy H can be defined as the process 
(H.S)t = (Smin(t,T 2 ) — <Smin(t,Ti))- A linear combination of buy-and-hold 
strategies is called a simple integrand. In the general case simple integrands 
are not sufficient to characterize these processes that admit an equivalent mar- 
tingale measure. On the other hand the use of general integrands leads the 
problem of the existence of (H.S). The so called admissible integrands avoid 
all of these pathologies. 

Formally, if S denotes an R -valued semi-martingale, defined on the filtered 
probability space (O, {Ft}teR+i P), an R -valued predictable process H is 
called a-admissible if it is 5-integrable, if Hq — 0, if the stochastic integral 
satisfies H. S > —a and if the limt_ 0o (/f.5)« exists a.s. If H is admissible 
for some a, then is simply call admissible. 

In order to characterize mathematically the NA and the No Free Lunch 
(NFL) properties, we need to consider the following vector spaces. Let us 
denote by L° the vector space of all real-valued measurable functions defined 
on fi. Endowed with the topology of convergence in probability, this space be- 
comes a Frechet space (i.e. a complete and metrisable vector space). L°° de- 
notes the subspace of L° of all bounded functions. It is remarkable that the two 
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spaces L° and L°° are, among the LP spaces, the only two spaces that remain 
the same when the original probability measure is replaced by an equivalent 
one. Let us to introduce the following sets: 

$ = {(#.5)00 / H is admissible}, 

$ a = {(H.S) a j H is a-admissible }, 

To = * - L° + , 

T = $ n J L°°. 
In all papers dealing with the Fundamental Theorem of Asset Pricing (with 
simple integrands), the assumption of NA or NFL essentially amounts to say- 
ing that the set $ does not contain any non-negative random variable except 
the null one. 

Formally, we say that the process S satisfies the NA property if: 

$nL°. = {0} (2.1) 

which is equivalent to the expression 

Tfll~ = {0}. 
The process S satisfies the NFL property if 

T n L~ = {0}, (2.2) 

where the bar denotes closure in the norm topology of L°°. 

The NFL is an old expression used in the early days of the finance literature. 
The NA postulates that the set of random variables which can be achieved by 
a zero cost portfolio does not include any positive random variable. The NFL 
condition, postulates the same on the topological closure of the previous set. 
The following technical definition is due to Kreps (1981). Let S be abounded 
process and let us denote by 4>* the set of all outcomes with respect to bounded 
simple integrands. T* is defined in the same way T* = ($* — L+) f~l L°°. 

Then, an adapted process S satisfies the NFL property, as above, if the cor- 
responding set of outcomes does not contain any non-negative random variable 
except the null, T* D L+ 3 = {0}, where the tilde denotes weak closure. Deal- 
ing with the weak closure it may happen that an element of this set can only be 
obtained by an unbounded generalized sequence. Unfortunately the economic 
interpretation of this unbounded objects is unclear. However requirements of 
NA and NFL in expressions (1) and (2) are very strong. We assume that S is a 
semi-martingale and there is an equivalent martingale measure for the process 
S. On the other hand we need a definition for the set of outcomes with respect 
to general admissible integrands. The following theorem from Delbaen and 
Schachermayer (1998), characterizes the NFL concept through a boundedness 
property in L°. 
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THEOREM 1 The process S satisfies the property NFL (2) if and only if it 
satisfies 



1 the NA property (1) and 

2 $i is bounded in the space L° 



They remark that the boundedness of the set $ihas the following economic 
interpretation: for outcomes that have a maximal loss bounded by 1 , the profit 
is bounded in probability, this means that the probability of making a big profit 
can be estimated from above, uniformly over all such outcomes. 

For further characterization of the NFL property and related results for lo- 
cally bounded semi-martingales S, see Delbaen and Schachermayer (1994, 
1998). 

A recent projective system approach to the martingale characterization of 
the absence of arbitrage is provided by Balbas et al. (2002). The equivalence 
between the absence of arbitrage and the existence of an equivalent martingale 
measure fails when an infinite number of trading dates is considered. Thus, 
enlarging the set of states of nature and the probability measure through a pro- 
jective system of perfect measure space, the authors characterize the absence 
of arbitrage when the time set is countable. 

The martingale characterization can be extended in the context of imperfect 
financial models, mainly financial models with proportional transaction costs, 
short sale constraints, convex cone constraints, etc. 

We can observe three main lines of research generalizing these initial results. 
The first one applies in the context of imperfect financial markets for a model 
with transaction costs. The second line of research expands the restricted fea- 
sible portfolio case, usually cone constraints. The third research direction and 
the most recent one is based on the assumption that the price is non-linear with 
respect to the portfolio. Then the subaditivity property is needed and the Cho- 
quet integral is a powerful tool to be used in this context. The asset pricing 
problem is then solved as a Choquet integral of the future returns with respect 
to a new capacity introduced by Chateaunef et al. (1994,1996). 

Currently there is a pressing need for a universal framework for the determi- 
nation of the fair value of financial and insurance risks. In the financial services 
industry, this pressing need is evidenced by the recent Basel Accords on regu- 
latory risk management that require fair value, analogous to market prices, to 
be applied to all assets or losses, whether traded or not. More recently Wang 
(2000, 2001) presents a universal framework for pricing financial and insur- 
ance risks. 
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2.2.2 Asset Pricing in Imperfect Financial Markets 

In the classical setting, the financial market is modeled in a "frictionless" 
way which is a clear idealization of the real world. Therefore models with 
transaction costs have been increasingly studied in the literature; see Davis 
and Norman (1990) or Striker (1990). Jouini and Kallal (1995a) characterize 
the assumption of NFL in a model with transaction costs and give fair pricing 
intervals for contingent claims in such a model. As for other imperfection, 
Jouini and Kallal (1995b, 1999) consider the case of short sale constraints 
or shortselling costs with possibly different rates for borrowing and lending 
rates. The problem of hedging contingent claims, in continuous time, is study 
by Cvitanic and Karatzas (1996). They propose a diffusion model (with one 
bond and one risky asset) with proportional transaction costs, and give a dual 
formulation for the so-called super-replication price of a contingent claim (i.e. 
the minimum initial wealth needed to hedge the contingent claim, or in other 
words, to obtain, through the investment opportunities available on the market, 
at least the contingent claim). Delbaen et al. (1998) generalize this result 
to the multivariate case, in discrete as well as in continuous time, and with a 
semi-martingale price process. In these models too, typically there is a "bond" 
which serves as numeraire asset. The usual assumption is that, at final date T, 
all the positions in the other traded assets are liquidated, i.e., converted into 
units of the bond. 

More recently, Jouini and Napp (2002) generalize existing results in the fol- 
lowing ways: first, they do not assume that there exists a numeraire available 
to investors and allowing them to transfer money from one date to another; this 
enables to consider any type of friction on the numeraire-like no borrowing, 
different borrowing and lending rates, bonds with default risk, etc. These set- 
ting also take into account the fact that all investors are not equal with regard 
to borrowing and lending, namely some investors may enjoy special borrowing 
facilities while others may not; second, they are led to introduce a new notion 
of NFL, which is the classical concept in finite time but does not exclude a 
free lunch at infinite and is therefore may be more economically meaningful; 
last, they characterize the NFL assumption for very general investments, which 
enables to consider investment opportunities that are not necessary related to 
a market model and, to generalize the results obtained for imperfect markets 
and to obtain them all in a unified way. Technically, all investment opportu- 
nities are described in terms of cash flow. Therefore, separation techniques in 
more complex spaces to obtain the Fundamental Theorem of Asset Pricing are 
needed. Let consider their main Assumption A. 

DEFINITION 2 An investment is an (Ft)t<=T-<*dapted process H — (Ht)teT> 
null outside a finite number of dates, i.e. there exists (ij , ...t^) such that 
H t = Ofor all t £ (tf)?L v and such that H t is in L 1 (fi, F t , P)for all t € T. 
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Definition 3 (Assumption A) There exists a sequence d = (d n ) n >o 
such that for all t* > 0, for all Bt* in Ft* of positive probability, there exists 
H in the convex cone of investment opportunities J, of the form Hp = out- 
side Bt*, Ht = for all t < t*,Ht > Oforall t > t* , and there exists d n 6 d, 
P[H dn >0]>0. 

Roughly, Assumption A corresponds to the possibility of transferring "some 
money" from any date and event to some particular date. This assumption is 
not too restrictive: it is satisfied if we can buy at every date and event a bond 
with a given maturity even if this bond is defaultable and even if there is no 
secondary market for that bond (i.e. we have to wait until maturity in order to 
recover any money with a positive probability, which may be different from 1); 
this includes market models with frictions on the numeraire like no borrowing, 
different borrowing and lending rates, bonds with default risk, different bor- 
rowing facilities among the investors. More generally, it is satisfied if there is 
at least one asset whose price cannot be negative (which is usually the case for 
stocks or for options, defaultable bonds,etc). 

Then a characterization of the NA property in a model with flows is given 
by Jouini and Napp (2002) in the following theorem. 

THEOREM 4 Let J denote a convex cone of investments satisfying Assump- 
tion A. There is NFL for J if and only if there exists a process g = (gt)teT 
satisfying for all t in T, P[0 < gt < M]for some M in R+, and such that 

E[ZteT9tH t } < for all H = (H t ) te T e J. 

Moreover, theprocess g can be taken (i ? t ) ie ^-adapted. 

In other words, there is NFL for a convex cone of available investments 
satisfying Assumption A if and only if a given convex set of "admissible" dis- 
count processes is non-void. The theorem ensures the existence of a "discount 
process" such that, using this process as deflator, all available investments have 
non-positive present value; this means that there exists a term structure such 
that the market consisting of the primitive investment opportunities and of the 
additional borrowing and lending facilities is still "arbitrage-free". Besides, 
the existence of such a discount process prevents from any arbitrage opportu- 
nity. Notice that Assumption A is not needed to obtain this result if the set of 
investment opportunities is related to a countable set of dates. 

Since most market models with frictions can fit in the model with flows for 
a specific convex cone of available investments, the model in Jouini and Napp 
(2002) provides a unified framework for the study of the characterization of 
the absence of FL in such imperfect market models. However this model with 
flows does not stand for economies with fixed transaction costs, since the set 
of available investments is not a cone. 

Kabanov (1999, 2001) develops a mathematical theory of currency markets 
with transaction costs based on ideas of convex geometry. He proposed an 
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appealing framework to model financial markets in a numeraire-free way for 
both frictionless markets and markets with transaction costs. This approach 
turns out to be conceptually interesting, even in the frictionless case, as it al- 
lows for a new look on the wealth processes, arising in financial modelling, 
without explicitly using stochastic integration: expressing portfolios in terms 
of the number of physical units of the assets, as opposed to the values of the 
assets in terms of some numeraire, opens new perspectives. Basically, the 
financial market is modelled by a dxd matrix-valued stochastic process spec- 
ifying the mutual bid and ask prices between ^-assets. The terms of trade at 
time t are modeled via an Ft -measurable non-negative dxd matrix -valued map 
u> —> £t(o>) denoting the bid and ask prices for the exchange between the d as- 
sets. The entry <t* j ' of T, t denotes the number of units of asset i from which an 
agent can trade in one unit of asset j in terms of the asset i bid-ask processes 
are defined as adapted processes taking values in the set of bid-ask-matrices. 
ofot* = 1 a.s. for all 1 < i, j < d and t = 0, ..., T in the frictionless case. 

Kabanov et al. (2001) introduce the bid-ask process in a somewhat indirect 
way. They start with a d-dimensional price process which models the prices 
of the d assets without transaction cost in terms of some numeraire (it may 
be a traded asset or not). One then defines a non-negative dxd -matrix A = 
(A y )i<i,j<rf of transaction cost non-negative coefficients A lJ , modelling the 
proportionally factor one has to pay in transaction costs, when exchanging the 
i'th into the jf'th asset. Then the bid-ask process is obtained as 

£>) = Diag{S t {u))- l {l + A t (oj))Diag(S t (uj)), (2.3) 



where I denotes the unit matrix (not to be confused with the identity ma- 
trix). 

Schachermayer (2002) presents a direct modelization of the bid-ask process 
£ = (^ t )^ =0 without first defining (St)J =0 and A. It seems more natural, from 
an economic point of view, as in a market with friction an agent is certainly 
faced with a bid-and an ask-price. But these prices are not necessarily decom- 
posed into a "frictionless" price and additional transaction costs. 

The notion of consistent price system (resp. strictly consistent) introduced 
by Kabanov and his co-authors extends the notion of equivalent martingale 
measures. Similar notions are in Schachermaver (2002). 

DEFINITION 5 An adapted R^. valued-process Z = (Zt)J- is called a con- 
sistent (resp, strictly consistent) price process for the bid-ask process E, ifZ 
is a martingale under P, and Z t (u>) lies in K*(u)\{Qi) (resp, in the relative 
interior ofKt(u)) a.s., for each t=0„,,,T„ 
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AT t *(£) = { £•/ € R d : < v,u > > 0, for v € «"(£)} is the polar 
of. — K (E), and .K"(£) is the solvency cone, i.e., the convex cone in R d spanned 
by the unit vectors e % , 1 < i < n, and the vectors a %3 e l — e 3 , 1 < i, j < d. 

The cone K*(E) has a nice economic interpretation, eluded by the term 
"consistent price system". A vector u> ^ is in K*{H) if it defines a friction- 
less pricing system for the assets 1,. . .,d which is consistent with the bid-ask- 
matrix £ in the following sense: if the price of asset i (denoted in terms of 
some numeraire) equals a; 1 , then the friction-less exchange rates,denoted by 
t %3 \ clearly equal 

r i3 = %,l<i,j<d. 

>From the economical point of view, a consistent price system o> = (u i )f = . 1 
is strictly consistent if, for all 1 < i, j < d, the exchange rate t* 3 = ^ is in 
the relative interior of the bid-ask spread [^7, a 13 ]. 

The main theorem in Kabanov etal. (2001) is the following version of the 
Fundamental Theorem of Asset Pricing: under an additional assumption, a bid- 
ask process E satisfies the strict NA condition, if there is a strictly consistent 
price system Z for E. The additional assumption is called "efficient friction" 
and requires that i*t(u>) = {0}, a.s., for all t = 0, ...,T. It was asked by these 
authors whether this additional assumption can be dropped. Schachermayer 
(2002) gives an example of a bid-ask process E, with d = 5 and T = 2, 
showing that, in general, the answer to this question is no. In the same paper a 
slight strengthening of the notion strict NA, called the robust no arbitrage NA r 
is introduced. A subsequent Fundamental Theorem of Asset Pricing as a main 
result is then formulated. 



2.2.3 Asset Pricing with Cone Constraints 

Pham and Touzi (1999) addresses the problem of characterization of NA in 
the presence of frictions in a discrete-time financial market model. They ex- 
tend the Fundamental Theorem of Asset Pricing with cone constraints on the 
trading strategies under a nondegeneracy assumption. In the presence of trans- 
action costs and under a nondegeneracy condition on the risky assets price 
process, they also prove that the NFL and the NA conditions are locally equiv- 
alent i.e. when trading is restricted to some period [t — l,t\. Their main result 
states the equivalence of the no local arbitrage condition and the existence of an 
equivalent probability measure satisfying a further generalization of the mar- 
tingale property. They do not provide a multiperiod version of this result. For 
a more general setting of convex constraints see Brannath (1997). 
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2.2.4 Nonlinear Asset Pricing 

On financial markets without frictions, no-arbitrage pricing allows to price 
non-marketed redundant assets using the equilibrium prices of the marketed 
assets. Assets are then valued by a linear function of their payoffs (mathemat- 
ical expectation). The equilibrium prices of the marketed assets determine a 
set of risk neutral probability distributions such that the equilibrium price of a 
redundant asset equals the mathematical expectation of its discounted payoff 
with respect these probability distributions. This pricing rule is consistent with 
equilibrium in the sense that, introducing a redundant asset at its no-arbitrage 
price does not affect the equilibrium allocation; see, for example, Harrison and 
Kreps (1979). In markets with frictions, pricing rules may be non-linear. Two 
portfolios yielding the same payoffs need not have the same formation cost 
(net of transaction cost), but the difference may not imply the existence of a 
free lunch because of frictions. Consider for example bid-ask spreads or trans- 
action costs. Then clearly prices (as a function of asset payoffs) are non-linear, 
since the price an agent has to pay for buying an asset is strictly larger than the 
price an agent receives for selling it. Therefore equilibrium asset prices cannot 
be represented by the mathematical expectation of their discounted payoff with 
respect to a probability measure. 

Asset valuation by a Choquet integral is introduced in Chateauneuf et al. 
(1996). They introduce a nonlinear valuation formula similar to the usual ex- 
pectation with respect to the risk-adjusted probability measure. This formula 
expresses the asset's selling and buying prices set by dealers as the Choquet 
integrals of their random payoffs. In this paper bid-ask spreads are consid- 
ered. Bid-ask spreads is one of many types of friction prevailing in financial 
markets which differs from the traditional formalization ofproportional trans- 
action costs. 

Let consider the following situation pointed out by Chateauneuf et al. (\ 996) : 
assumed that a dealer sells an asset Y (defined by its flow of payoffs) at a price 
q(Y) and that she buys it a price — q(— Y) such that she makes the positive 
profit q(Y) + q(—Y) > 0. Then, because = q(0) — q(Y — Y), q cannot 
be linear, hence it cannot be calculated as q(Y) = f s Ydfj,, where S is the 
set of random states and /x is some risk-adjusted probability over S. In these 
settings, the paper imposes certain axioms on prices (generalizing the usual no- 
arbitrage conditions) and deduces from them a result on the structure of prices 
(representation as Choquet integral: an expectation with respect to a concave 
capacity). Capacities were introduced by Schmeidler (1989) in individual de- 
cision theory. Formally, a capacity on a measurable space (S, Fs) is a set of 
functions v : Fs — > [0, 1] satisfying v{Fs) = l,v($) — 0. Furthermore v is 
said to be convex (resp. concave or supermodular) if 



An Overview of Probabilistic and Time Series Models in Finance 37 



v(A UB) + u(A DB)> (resp. <) u{A) + u{B), for all A, Be F s . 

In this context, a convex capacity is interpreted as a representation of risk 
(uncertainty) aversion. This characterization of uncertainty aversion has been 
used in single-agents models for which convex capacities are representations 
of individual behaviors. In contrast, Chateauneuf etal. (1996) use a model for 
which agents are price takers and the concave capacity is derived from prices. 

Formally, the model uncertainty they consider is described by the measur- 
able state space (S, Fs) where Fs is a given cr-algebra of events of S. An 
asset is defined by the random variable X of its payoffs. Bounded assets are 
considered. These assets are sold and bought by a dealer to agents. Hence, 
all traded assets have a bid and an ask price fixed by the dealer. These prices 
are described by q(Y) and — q(— Y) respectively, i.e., the prices at which the 
dealer sells asset Y to agents and buy asset Y from agents. Three axioms on 
prices which generalize the usual NA conditions to market with a dealer are 
then imposed. The first is the usual NFL. The second one, as is usually done 
in pricing models, assumes no transaction costs on riskless assets. The third 
axiom replaces the (usually implicit) tight markets condition. Traditionally, 
two portfolios yielding the same payoffs must have the same price, implying 
that price functional is linear. Taking into account potential reduction of risks 
when portfolio X + Y is sold instead of X or Y alone induces the dealer to 
sell X + 7 at a discount to X and Y. 

A typical example where hedging effects occur and X and Y are not comono- 
tone {comonotonicity :=for all s, s' € S, [X(s) - X(s')][Y(s) - Y(s')] > 0), 
is the following one from Chateneauf et al. (1996). Suppose that X offers 
1000 if even B occurs, 5000 otherwise, Y offers 5000 if B occurs, 1000 
otherwise. Clearly X and Y are not comonotone and X (resp Y) is a hedge 
against Y (resp. X) since X + Y is riskless: it offers 6000 with certainty. 
So, subadditivity for q : q(X + Y) < q(X) + q(Y) is required. Notice that, 
consequently, no discount will be offered by the dealer when X and Y are 
comonotone; i.e., q(X + Y) = q{X) + q(Y) if X and Y are comonotone. 
Then the third axiom (Comonotonicity Premium) expresses for all X, Y € A : 
q(X + Y) < q(X) + q(Y) equality holds if X and Y are comonotone. Their 
main result is the so-called Choquet Sublinear Pricing Theorem. Under the 
three axioms as above this theorem asserts that there exists a unique concave 
capacity v on the set of states S such that the value of an asset X is defined by 
q{X) = Max{j s X dfx; fj, is an additive probability s.t. fi < v}. The price of 
Xis the Choquet integral of its payoffs: q(X) = J s X dv, where 

/ Xdv = f \v{X >t)- l)dt + / [u{X > t)dt, 

JS JR- JR + 
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and q is sublinear (i.e. subadditive and positively homogeneous, and indeed q 
is concave). 

Application to pricing "primes" and "scores" are given in the paper of 
Chateauneuf et al. (1996). 

In these settings De Waegenaere et al. (1996) propose a pricing rule for the 
valuation of assets on financial markets with intermediaries. They assume that 
the non-linearity arises from the fact that dealers charge a price for their inter- 
mediation between buyer and seller. The price of an asset equals the signed 
Choquet integral of its discounted payoff with respect to a concave signed ca- 
pacity. Furthermore, they show that this pricing rule is consistent with equilib- 
rium and equilibria satisfy a notion of constrained Pareto optimality. 

On the other hand, a universal framework for pricing financial and insurance 
risks has been introduced recently by Wang (2000) who proposes a pricing 
method based on the following transformation F*(x) = (j)[<j)~~ x (F(x)) + A], 
where <fr is the standard normal cumulative distribution. The key parameter A 
is called the market price of risk, reflecting the level of systematic risk. For a 
given asset X with F{x) = Pr{X < x}, the Wang transform will produce 
a "risk-adjusted" cumulative probability distribution F*(x). The mean value 
under F*(x) will define a risk-adjusted "fair value" of X at time T, which 
can be further discounted to time zero, using the risk-free interest rate. This 
approach is partly inspired in the work of Venter (1991) and Butsic (1999). 

2.3 Time series models 

In this section, we revise the literature on the time series models usually 
fitted to financial data. As this is a very broad area, the focus is only on the 
main branches of the literature with special attention to the most recent devel- 
opments. Campbell et al. (1997) and Tsay (2002) present excellent textbook 
reviews of Financial Econometrics and Bollerslev (2001) and Engel (2001, 
2002a) have very interesting discussions on past developments and future per- 
spectives in this area. 

Traditionally, the two main motivations to use time series models to ana- 
lyze financial data are to represent the empirical properties often observed in 
real prices and to estimate and test the financial models described in section 2. 
In this section, we describe models proposed mainly to represent the empiri- 
cal properties of financial prices while section 4 is devoted to the relationship 
between time series models and Finance theory. 

The empirical properties of financial prices depend crucially on the fre- 
quency of observation. We consider three main classes of frequencies. First, 
it is possible to observe prices at very high frequencies as, for example, tick 
by tick or hourly prices. These observations are called Ultra-high-frequency 
(UHF) data by Engle (2000) and they are usually characterized by unequally 
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spaced and discrete-value observations. Another important property is the 
presence of strong daily patterns with highest volatility at the open and toward 
the close of the day. On top of this intraday volatility pattern, UHF returns are 
characterized by highly persistent conditionally heteroscedastic components 
along with discrete information arrival effects; see Andersen and Bollerslev 
(1997a, 1997b, 1998), Miiller et al. (1997) and Andersen et al. (2001). Fi- 
nally, it is possible to have multiple transactions within a single second. 

Prices can also be observed at high frequencies, as for example, daily or 
weekly. This frequency is the most extensively analyzed in the empirical liter- 
ature. There is a vast number of papers that show that high frequency returns 
are nearly non-correlated although they are not independent because there are 
non-linear transformations, as squares or absolute values, that have significant 
autocorrelations. Furthermore, these autocorrelations are usually small and 
decay very slowly towards zero. The significant autocorrelations of squared 
returns are often related with the presence of volatility clustering, i.e. periods 
of low volatility are usually followed by periods of low volatility and vicev- 
ersa. Furthermore, the slow decay is usually interpreted as the presence of 
long-memory in the volatility; see Lobato and Savin (1998) and Granger et 
al. (2000) and the references therein. On the other hand, high frequency re- 
turns are often leptokurtic and, consequently, non-Gaussian. The heavy tails 
property of returns can also be related with the dynamic evolution of volatility. 

Finally, prices are sometimes observed at very low frequencies as, for exam- 
ple, monthly. Tsay (2002) shows that monthly returns still have excess kurtosis 
although smaller than in lower frequencies. On the other hand, monthly returns 
seem to have more serial correlations than daily returns. Given that low fre- 
quencies are not in general of interest for asset pricing models, the focus in this 
section is on UHF and high frequency observations. 

The rest of the section is organized as follows. Subsections 3.1 to 3.3 deal 
with models for high frequency observations. In subsections 3.1 and 3.2, we 
describe the models usually fitted to represent expected returns and volatilities 
respectively. In subsection 3.3, we consider multivariate models for systems of 
returns. Finally, in subsection 3.4, we describe models for UHF data. 

2.3.1 Models for the conditional mean 

One of the central questions in the Financial Econometrics literature is whe- 
ther financial prices are predictable and this is still a topic of controversy; see, 
for example, the special issue of the Journal of Empirical Finance, 8 (2001). 
In this section we describe univariate models and, consequently, the problem 
is whether future prices can be predicted with information contained in their 
own past. The main hypothesis that have often been tested are the martingale 
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and the random walk hypothesis. The martingale hypothesis can be expressed 
as follows: 

E[P t |P t _i > P t _2,...] = P t -i (3.1) 

Therefore, given the prices up to time t — 1, the price at time t is expected 
to be equal to the price at time t — 1. The martingale hypothesis places a re- 
striction on expected returns but does not take into account the risk. However, 
as said in section 2, once asset returns are properly adjusted for risk, the mar- 
tingale hypothesis holds for rationally determined asset prices; see Harrison 
and Kreps (1979). It is known that, the risk-adjusted martingale property is the 
basis of many financial derivatives as, for example, options and swaps; see, for 
example, Merton (1990) and Campbell et al. (1997). 

The second hypothesis often tested in the financial literature is whether 
prices are generated by a random walk plus drift model given by: 

P t = n + P t -i + e t (3.2) 

where St is an independent process with zero mean and variance a 2 and /j, is the 
expected price change. In model (5), if the distribution of the errors et is, for 
example, Gaussian, there is a positive probability that prices can be negative, 
violating limited liability. Therefore, it is usual to assume the random walk 
model not for prices but for logarithmic prices, i.e. 

log(P t ) = ii + log(Pt-i) + e t (3.3) 

In model (6) any arbitrary transformation of prices is unforecastable using 
any arbitrary transformation of past prices. However, it is usual to assume 
that the errors et are merely uncorrected instead of independent allowing, for 
example, for the presence of conditional Heteroscedasticity.. As we have men- 
tioned before, this is a property often observed in high frequency returns. Con- 
sequently, we will focus on tests of the random walk hypothesis where e t is 
uncorrected. 

When testing the null hypothesis that the autocorrelation coefficients of re- 
turns, r t = Alog(P ( ), are all zero, it is important to take into account that 
et is not independent because, usually, e 2 is correlated. Therefore, the tradi- 
tional tests for uncorrelatedness should be adequately modified; see Romano 
and Thombs (1996) and Lobato et al. (2001) among others. 

Alternatively, the random walk hypothesis can be tested using the Variance 
Ratio (VR) statistic. This test is based on the property that the variance of 
random walk increments is a linear function of time interval; see Campbell et 
al. (1997) for a detailed description of the VR test. 

The implementation of the previous tests to financial prices, seems to sug- 
gest that financial asset returns are predictable; see the special issue of the 
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Journal ofEmpirical Finance, 8 (2001) and the references therein. There are 
several alternative explanations for this predictability. For example, Campbell 
etal. (1997) and Lo and MacKinlay (1990) show that nonsynchronous trading 
can introduce negative autocorrelations in returns. The bid-ask spread can also 
introduce negative autocorrelations in asset returns; see, among others, Camp- 
bell etal. (1997). Other possible explanations are time -varying risk premiums 
as in Harvey (2001) and Bekaert et al. (2001), irrational behavior of market 
participants in Hong and Stein (1999), Benartzi and Thaler (1995), Barberis et 
al. (2001) and Epsein and Zin (2001), market frictions as transaction costs or 
agency problems or fluke due to statistical inference. 

2.3.2 Models for the conditional variance 

Although, it is generally accepted that asset returns appear to be close to 
a martingale difference process, there is an overwhelming evidence that they 
are not independent due to autocorrelated squares. Assuming that returns have 
zero mean and are serially uncorrelated, they can be represented by the follow- 
ing model: 

n = o t e t (3.4) 

where et is an independent and identically distributed (i.i.d.) process with zero 
mean and unity variance independent of the volatility, at. There are two main 
proposals in the literature to represent the dynamic evolution of at'. General- 
ized Autoregressive Conditional Heteroscedasticity (GARCH) and Stochastic 
Volatility (SV) models. 

GARCH models, originally proposed by Engle (1982) and Bollerslev (1986), 
are based on modelling the volatility as the variance of returns conditional on 
past observations. There is a pleyade of papers where GARCH models are 
investigated from a theoretical point of view or are applied to the empirical 
analysis of financial time series. The main properties of GARCH models have 
been reviewed, among others, by Bollerslev et al. (1995) and Carnero et al. 
(2001a). Although the original motivation of GARCH models was mainly 
empirical, Nelson (1992) shows that even when mispecified, ARCH models 
may serve as consistent filters for the continuous-time stochastic volatility dif- 
fusions often employed in the asset pricing literature. Furthermore, Nelson 
(1990, 1994) and Nelson and Foster (1994) provide some important links be- 
tween GARCH and the corresponding continuous-time models. 

The original GARCH model has been extended in a huge number of direc- 
tions. Two of the main extensions from the empirical point of view, are models 
to represent the asymmetric response of volatility to positive and negative re- 
turns and to represent the effect of the volatility on the return of a stock. The 
first effect is known as leverage effect and was introduced by Black (1986). 



42 RECENTS AD VANCES IN APPLIED PROBABILITY 

The first model proposed to represent the leverage effect was the Exponential 
GARCH (EGARCH) model ofNelson (1991). Later, Hentschel (1995), Duan 
(1997) and He and Terasvirta (1999) have proposed models general enough 
to unify many of the main previous ARCH-type models. With respect to the 
effect of volatility on the expected return, Engle et al. (1987) introduced the 
GARCH in mean (GARCH-M) model given by 

r t = n + cof + a t (3.5) 

at = &t£t 

a\ = u) + aa 2 t _ x + $o\_ x 

The parameter c is known as the riskpremium parameter. Returns generated 
by the GARCH-M model are autocorrelated because of the autocorrelations of 
the volatility, of 

There are many other generalizations of the original GARCH model. For 
example, Zakodian (1994) allows for regime switching where volatility per- 
sistence can take different values depending on whether returns are in a high 
or a low volatility regime. To represent the long memory property of squared 
returns, Baillie et al. (1996) introduce the Fractionally Integrated GARCH 
(FIGARCH) model. Although the FIGARCH model has been fitted in several 
empirical applications, it is not stationary in covariance and, consequently, the 
properties of the corresponding estimators and tests are generally unknown. 
Finally, Engle and Lee (1999) have proposed a GARCH model with two com- 
ponents in volatility: one which is nearly nonstationary and another that is 
much less persistent. 

All GARCH models have the attractive that can be easily estimated by Max- 
imum Likelihood techniques. However, Terasvirta (1996) and Carnero et al. 
(2001b) show that the basic GARCH(l.l) model is not flexible enough to rep- 
resent adequately the properties often observed in real time series of returns. 

Alternatively, the volatility, of, can be modelled using SV models that in- 
troduce an additional noise in its equation. Therefore, the volatility is a latent 
variable composed of a predictable component, that depends on past returns, 
plus an unexpected component. SV models were originally proposed by Taylor 
(1986) and their properties have been reviewed by Taylor (1994), Ghysels et 
al. (1996) and Shephard (1996). The introduction of the unobserved compo- 
nent in the representation of the volatility, gives more flexibility to SV models 
to represent the empirical properties often observed in real time series of re- 
turns; see Carnero et al. (2001b). However, the estimation of these models 
present some added difficulties over the estimation of GARCH models. The 
likelihood function has not a close form and, consequently, most estimation 
methods proposed in the literature are based on numerical approximations of 
the likelihood or on transformations of the observations. Although, there is 
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not still a consensus about which are the most adequate methods to estimate 
SV models, recently there has been important progress towards methods that 
are computationally feasible and, at the same time, have properties similar to 
the Maximum Likelihood estimators; see Broto and Ruiz (2002) for a detailed 
description of estimation methods for SV models. 

Recently, Chib et al. (2002) have proposed the following SV model where 
returns can contain ajump component to allow for large, transient movements, 



r t = x t P + k t q t + wja t e t (3.6) 

log al = n + z' t a + 4>(log a\_ i - /x) + »ft 

where x t , u; t and 2 t are covariates and 7denotes the level effect. The covariate 
wt is a non-negative process as, for example, lagged interest rates; see Ander- 
sen and Lund (1997). The noises et and rjt are mutually independent Student- 
t and Gaussian white noise processes respectively, both with zero mean and 
variances one and ai. Finally, with respect to the jump component, q t is a 
Bernoulli random variable that takes value one with probability k and kt is 
the size of the jump distributed as log(l + kt) ~ N(— 0.5S 2 ,5 2 ). They ar- 
gue that model (9) without the jump component can be thought of as an Euler 
discretization of a Student-t Levy process with additional stochastic volatil- 
ity effects. This process has been used in the continuous time options and 
risk assessment literature; see, for example, Barndorff-Nielsen and Shephard 
(2002b), Eberlein (2002) and Eberlein and Prause (2002). On the other hand, 
models withjumps have also been frequently applied in continuous time mod- 
els of financial asset pricing; see, for example, Merton (1976), Ball and Torous 
(1985), Bates (1996), Duffie etal (2000) and Barndorff-Nielsen and Shephard 
(2001). From the point of view of the Financial Econometrics literature, SV 
models withjumps have been previously considered by Chernov et al. (2000), 
Barndorff-Nielsen and Shephard (2002a) and Eraker et al. (2003). 

As in the case of GARCH models, SV models have also been extended 
to represent the asymmetric response of volatility to negative and positive re- 
turns and the response of expected returns to volatility by Harvey and Shephard 
(1996) and Koopman and Uspensky (2002) respectively. Another extension of 
S V models considered in the literature is to allow for long memory in volatility; 
see Harvey (1998) and Breidt et al. (1998). 

2.3.3 Models for conditional covariances 

Multivariate models have been often used to represent financial series of 
returns related, for example, with the Asset Pricing Theory (APT), asset al- 
location, estimation of time-varying betas or Value at Risk (VaR). However, 
although numerous multivariate models for returns have been proposed, there 
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is not jet a consensus about which models are better mainly due to a dimension- 
ality problem. The literature on multivariate GARCH models is often related 
with the lack of parsimony of these models and the constraints needed to guar- 
antee that the conditional covariance matrix, St, is positive definite; see Engle 
(2002a,b) who revises the most popular multivariate models proposed in the 
context of GARCH. The dimensionality becomes very quickly a problem be- 
cause the conditional covariance matrix of a k-dimensional return series has 
k(k+l)/2 distinct quantities. To keep the number of parameters low, Boller- 
slev (1990) considers a multivariate GARCH model with constant correlations 
that always satisfies the positive-definite condition of Et. The constant cor- 
relation hypothesis can be tested using the Lagrange multiplier test proposed 
by Tse (2000). Because of its computational simplicity, the constant correla- 
tion model of Bollerslev (1990) has been widely used in the empirical analysis 
of financial data. However, if the correlations evolve over time, this model is 
inadequate and can give incorrect inferences. Very recently, there have been 
different proposals of multivariate GARCH models with time varying condi- 
tional correlations. For example, Tsay (2002) proposes two alternative ways of 
dealing with the conditional covariance matrix. The first one consists of model- 
ing directly the evolution of the autocorrelation and the second is based on the 
Cholesky decomposition of Et. The attractive of the second alternative is that 
it does not require any constraint to ensure the positive defmiteness of Et. Al- 
ternatively, Tse and Tsui (2002) propose a multivariate GARCH (MGARCH) 
model with time-varying correlations where the constraints required to ensure 
positive definite covariance matrix can be imposed during the optimization 
procedure. Finally, Engle (2002b) proposes a nonlinear Dynamic Conditional 
Correlation (DCC) model that can be estimated in two steps from univariate 
GARCH models. Alternatively, Ledoit et al. (2003a) also propose a two step 
estimation procedure of the original unrestricted diagonal-Vech multivariate 
GARCH(1,1) model of Bollerslev etal. (1988) given by 

Cov(r it ,rj t | !W) = hij,t = <Hj + aijru-irjt-i + fey/iy-.t-i (3.7) 

In the first step, the parameters are estimated separately by estimating the 
two-dimensional or one-dimensional equations in (10). Then, the estimated 
matrices are transformed to guarantee positive semi-defmiteness. 

An extensive and detailed comparison between the alternative models to 
represent time-varying correlations is still to be done. 

Another completely different approach to simplify the dynamic structure of 
a multivariate volatility process is to use factor models. Multivariate factor 
models provide a way of dealing with the APT; see, for example, Campbell 
et al. (1997) for a very simple exposition. Denoting by yt the Nx\ vector of 
returns at time t, it is given by 
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r t = a + Bf t + e t (3.8) 

(e' t f' t )' ~ NID(0,D) 

where D is a diagonal matrix, 5 is the matrix of factor loadings and ft is 
a iT dimensional vector of factors. The APT says that, as the dimension of 
Vt increases (approximating the market), then a ~ ir + BX, where r is the 
riskless interest rate, i is a vector of ones and A is a vector representing the 
factor risk premium associated with the factors often identified as the variances 
of the factors. However, the normality assumption in (1 1) is usually inadequate 
for high frequency series of returns. Consequently, this assumption has been 
relaxed in the consequent literature. Diebold and Nerlove (1989) and King et 
al. (1994) analyze factor models where the factors and idiosyncratic errors 
follow their own ARCH process. Sentana and Fiorentini (2001) show that the 
identifiability restrictions for conditionally heteroscedastic factor models are 
less severe than in static factor models. 

In the context of SV models, the first multivariate model was originally 
proposed by Harvey et al. (1994) who allow the variances and covariances 
to evolve through time with possibly common trends. Later, Ray and Tsay 
(2000) used the same model to study common long memory components in 
daily stock volatilities of groups of companies. However, the multivariate SV 
model of Harvey et al. (1994) restricts the correlations to be constant over 
time. Later, Jacquier et al. (1995) propose a factor SV model given by 

r t = Bft + et (3.9) 

et ~ NID{0 t I) 

ft ~ SV(<fS*;a*;0),i = l,...,K 

Kim et al. (1998) generalize model (12) by allowing the idiosyncratic 
noises to follow independent univariate SV models. Then, Aguilar and West 
(2000) and Pitt and Shephard (1999) implement the model using two alterna- 
tive Monte Carlo Markov Chain (MCMC) techniques. Finally, Tsay (2002) 
presents a MCMC estimation of the multivariate SV model based on the Cho- 
lesky decomposition. 

2.3.4 Models for intradaily data 

The analysis of UHF data is closely related with what is known as Mar- 
ket Microestructure and is one of the most active research areas in Financial 
Econometrics. However, traditional econometric tools may not be appropri- 
ate as tick by tick observations are not equally spaced and discrete valued. In 
this case, it is possible to use market point processes or continuous time meth- 
ods in which the sampling frequency is determined by some notion of time 
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deformation; see, for example, Andersen (1996). With respect to using UHF 
data to estimate the volatility, Andersen and Bollerslev (1998) show that the 
precision of volatility forecast is improved if the data are sampled more fre- 
quently. However, UFH data are affected by problems as the bid-ask spread or 
non-synchronous trading that, as previously mentioned, can generate autocor- 
relations in returns. Andersen et al. (2001) develop new robust methods for 
inference in the UHF data setting. Their approach is based on an extension of 
the Fourier Flexible Form (FFF) regression framework. 

Hausman et al. (1992) proposes an ordered probit model to study price 
movements in transactions data where the explanatory variables are the dura- 
tion between trades, the bid-ask spread, the lagged values of price change and 
volume, the return of the S&P500 index and an indicator variable that depends 
on the bid and ask prices. Alternatively, Rydberg and Shephard (2003) pro- 
pose to decompose the price change into three components: an indicator for 
the price change, the direction of the change and the size of the change. 

Finally, when analyzing UHF data, it is important to model not only the 
trades but also the timing between trades. In this sense, Engle and Russell 
(1998) propose the Autoregressive Conditional Duration (ACD) model that 
estimates the distribution of the time between events conditional on past in- 
formation. Later, Dufour and Engle (2000) show that the more frequent the 
transactions, the greater the volatility. Furthermore, they show that transac- 
tion arrivals are predictable based on economic variables as the bid-ask spread. 
Zhang et al. (2001) extend the ACD model to account for nonlinearity and 
structural breaks in the data. Finally, Tsay (2002) introduces the Price Change 
and Duration (PCD) model to describe the multivariate dynamics of prices 
changes and associate durations. 

2.4 Applications of time series to financial models 

Summarizing the literature described in sections 2 and 3, it seems rather 
clear that there is a gap between the theoretical asset pricing and the Finan- 
cial Econometrics literature. First, although continuous time methods and 
no-arbitrage arguments are prominent in the asset pricing literature, most in- 
fluential contributions have been derived under very restrictive assumptions 
about the underlying process. For example, the Black-Scholes option valua- 
tion formula assumes constant volatility when, it is generally accepted empir- 
ically, that volatility evolves over time. However, recently, some authors have 
proposed more realistic continuous time processes with time varying volatili- 
ties; see, for example, Hull and White (1987), Heston (1993), Duffie and Kan 
(1996) and Dai and Singleton (2000). Engle (2001) suggests that the use of 
UHF data potentially could provide information on the more appropriate class 
of diffusion models to use for pricing both underlying and derivative assets. 
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On the other hand, the Financial Econometrics literature has many challenges 
to provide instruments adequate to represent the behavior of asset prices. The 
econometrics of, for example, jump diffusion or affine models are difficult. 
Bollerslev (2001) points out that recent research on the link between the prob- 
ability distributions of actual asset prices and the corresponding risk-neutral 
probability distributions implied by derivative prices has just started and that 
much research remains to be done. Some relevant references in this sense 
are Ait-Sahaila and Lo (2000), Andersen et al. (2002), Chernov and Ghysels 
(2000) and Duffie et al. (2000). Also, it is very useful the guest editorial by 
Ghysels and Tauchen (2003) and all the papers within the special issue of the 
Journal ofEconometrics on the intersection between Financial Econometrics 
and Financial Engineering. 

2.4.1 Estimation of the CAPM 

Two classical pricing models arise in the financial literature. Capital Asset 
Pricing Model (CAPM) is a set ofpredictions concerning equilibrium expected 
return on assets; see, for example, Sharpe (1964) or Lintner (1965). Classic 
CAPM assumes that all investors have the same one-period horizon, and asset 
returns have multivariate normal distributions. For a fixed time horizon, let E4 
and Rm be the returns of asset i and of the market portfolio M, respectively. 
Classic CAPM, sometimes called Sharpe-Lintner CAPM, asserts that 

E[Ri] = r + Pi{E[R M ]-r} (4.1) 

where r is the risk- free return and /3* = U is the beta of asset i. 

Assuming that asset returns are normally distributed and the time horizon is 
one period (e.g., one year), a key concept in financial economics is the market 

E\Ri} - r 
price of risk, given by Aj = — . In asset portfolio management, this is 

also called the Sharpe Ratio, after William Sharpe. 

In terms of market price of risk, CAPM can be restated as follows: 

E[Rj] - r covjRuRm) E[R M )- r 
Aj — = . = Pi t M*M, (4.2) 

where pi t M is the linear correlation coefficient between i?jand Rm. In other 
words, the market price of risk for asset i is directly proportional to the corre- 
lation coefficient between asset i and the market portfolio M. 

CAPM automatically prices assets in the set of all linear combinations of 
basic assets according to this linearity rule, as long as the market portfolio used 
in the CAPM is the mean-variance efficient portfolio of risky assets (alternative 
termed the Markowitz portfolio). CAPM provides a powerful insight regarding 
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the risk-return relationship, where only systematic risk deserves an extra risk 
premium in an efficient market. However, CAPM and the concept of "market 
price of risk" were developed under the assumption of normal multivariate 
distributions for asset returns, and in practice the underwriting beta can be 
difficult to estimate. 

On the other hand, a common practice pricing non-marketed assets is to 
infer the price applying the CAPM formula to this asset as well, by simply 
entering the random payoff B corresponding to the non-marketed asset into 
the CAPM formula. Technically, the new price has a systematic relationship to 
the prices of the basic assets, more precisely, it is the price of the marketed asset 
that best approximates the random payoff B in the sense of minimum expected 
squared error. Following geometric and statistical considerations, Luenberger 
(2002a) proposes a correlation pricing formula similar to the CAPM formula, 
which expresses the price of a non-marketed asset in terms of a priced asset 
that is the most correlated with the non-marketed asset, rather than in terms 
of the marked portfolio. The method has accuracy advantages when values in 
the formula must be estimated. Beyond the NA principle, Luenberger (2002b) 
derives a pricing method for non-marketed assets determining the price such 
that an investor with a specific utility function will elect to include the new 
asset in his/her portfolio at the zero level. The idea of zero-level pricing of a 
non-marketed payoff is to find the price such that a certain investor will elect 
to neither purchase nor short it. At this price the investor is indifferent to the 
inclusion of the considered payoff. Conditions ensuring for such a price to be 
unique are given in Luenberger (2002b). 

Besides CAPM, another major financial pricing paradigm is modern option 
pricing theory, first developed by Black and Scholes (1973). Unfortunately, 
the Black-Scholes formula only applies to lognormal distributions of market 
returns. Options pricing is performed in a world of Q-measure, where the avail- 
able data consists of observed market prices for related financial assets. On the 
other hand, actuarial pricing takes place in a world of P-measure, where the 
available data consists of projected losses, whose amounts and likelihood need 
to be converted to a "fair value" price; see Panjer (1998). Because of this dif- 
ference in types of data available, modern option pricing is mostly concerned 
with the minimal cost of setting up a hedging portfolio, whereas actuarial pric- 
ing is based on actuarial present value of costs, with additional adjustments for 
correlation risk, parameter uncertainty and cost of capital. In these setting new 
research directions are proposed in the recent literature. 

The statistical framework for estimation and testing for the classical CAPM 
is the Maximum Likelihood (ML) approach; see Campbell etal. (1997), Gib- 
bons etal. (1989) and Bollerslev etal. (1988). 

Inferences when there are deviations from the assumption that returns are 
jointly normal and iid through time have been developed. Tests which accom- 
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modate non-normality, heteroscedasticity, and temporal dependence returns 
are of interest for two reasons. First, while the normality assumption is suf- 
ficient, it is not necessary to derive the CAPM as a theoretical model. Rather, 
the normality assumption is adopted for statistical purposes. Without this as- 
sumption, the finite sample properties of asset pricing model tests are difficult 
to derive. Second, departures of monthly security returns from normality have 
been documented. As we have pointed out in this review, there is also abun- 
dant evidence of heteroscedasticity and temporal dependence in stock returns. 
It is therefore of interest to consider the effects of relaxing these statistical hy- 
pothesis. Robust tests of the CAPM can be constructed using a Generalized 
Method of Moments (GMM). Within the GMM framework, the distribution 
of returns conditional on the market return can be both serially dependent and 
conditionally heteroscedastic. The only assumption is that excess asset returns 
are stationary and ergodic with finite fourth moments. GMM procedure to es- 
timate time-varying term premia and a consumption based asset pricing model 
are used in Hansen and Singleton (1982) and Hansen and Scheikman (1995). 

Other lines of research are also of interest. One important topic is the ex- 
tension of the framework to test conditional versions of the CAPM, in which 
the model holds conditional on state variables that describe the state of the 
economy. Econometric methods from section 3 are suitable for testing the 
conditional CAPM. 

Another important subject is Bayesian analysis of mean-variance efficiency 
and the CAPM. Bayesian analysis allows the introduction of prior information. 
Harvey and Zhou (1990) and Kandel et al. (1995) are examples of work with 
this perspective. 

There is a controversy about the statistical evidence against the CAPM in 
the past 30 years. Some authors argue that the CAPM should be replaced 
by multifactor models with several sources of risk; others argue that the evi- 
dence against the CAPM is overstated because of mismeasurement of the mar- 
ket portfolio, improper neglect of conditional information, data snooping, or 
sample-selection bias; and yet others claim that no risk-based model can ex- 
plain the anomalies of stock-market behavior. Campbell et al. (1997) explore 
multifactor asset pricing models. 

2.4.2 Estimation of the term structure 

There is a vast literature devoted to the estimation of dynamic models of the 
term structure that describe the evolution of yields at all maturities. One of the 
main problems in this area is that the theoretical models need to be complex 
enough as to represent adequately the empirical complexity often observed. 
However, as the complexity of the models increases, their estimation becomes 
more difficult. 
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Models of the term structure focus mainly on affine models, characterized 
originally by Duffie and Kan (1996), that assume that the market price of risk is 
a multiple of the interest rate volatility and that the state variables are indepen- 
dent. Under these assumptions, ML estimation of the parameters is feasible. 
However, many empirical studies have shown that this model has fundamental 
limitations; see, for example, Ghysels and Ng (1998) and Dai and Singleton 
(2000) between many others. To overcome these limitations, Dai and Single- 
ton (2000) propose the multivariate affine term structure models while Ahn et 
al. (2002) propose the quadratic term structure models. However, neither of 
these models is able to track adequately the dynamic evolution of volatility. 
Recently, Ahn et al. (2003) investigates whether an hybrid model between 
affine, quadratic and nonlinear models is able to outperform each of the indi- 
vidual models. However, they conclude that, in general, this is not the case. 
Dai and Singleton (2003) is an excellent review on models of the term struc- 
ture described from the point of view of their empirical implementation. They 
focus on the fit of the theoretical specifications of dynamic structure models to 
the historical shapes of the yield curves. 

On the other hand, as we mentioned before, the estimation of these more 
complex models becomes difficult as the likelihood does not have, in general, 
a close form. One of the most popular methods in this context is the Efficient 
Method of Moments (EMM) of Gallant and Tauchen (1996). Duffee and Stan- 
ton (2003) estimate a multifactor term structure model with correlated factors, 
nonlinear dynamics and flexible price of interest rate risk, using both the EMM 
and an approximate Kalman filter. They conclude that the best results are ob- 
tained when the latter procedure is used to estimate the model although it is 
not asymptotically optimal. However, their results reveal severe biases in the 
parameter estimates regardless of the estimation method; see also Duan and 
Simonato (1999) and Chen and Scott (2002) for other authors that have also 
used the Kalman filter to estimate the term structure. 

2.4.3 Estimation of the VaR 

Regulators and risk managers are interested in obtaining measures of the 
Value at Risk (VaR), defined as the expected loss of a portfolio after a given 
period of time (usually 10 days) corresponding to the a% quantile (usually 
1%). This interest has motivate new methods designed to estimate the tails 
of the distribution of returns. There are several methods to estimate the VaR. 
The early VaR parametric models impose a known theoretical distribution to 
price changes. Usually it is assumed that the density function of risk factors 
influencing asset returns is a multivariate normal distribution as, for example, 
in J.P. Morgan (1996). The most popular parametric methods are variance- 
covariance models and Monte Carlo simulation. However, excess kurtosis of 
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these factors will cause losses greater than VaR to occur more frequently and 
be more extreme than those predicted by the Gaussian distribution. Conse- 
quently, several authors propose to use nonparametric (historical simulation) 
and semiparametric models that avoid to assume a particular distribution of 
price increments although they usually assume independent increments; see, 
for example, Danielsson and de Vries (1998). Finally, some authors propose 
to use extreme value theory estimation of tail shapes to estimate the VaR; see, 
for example, Embrechts etal. (1997) and McNeil and Frey (2000). In relation 
with these methods, Pearson and Smithson (2002) describe refinements which 
increase computational speed and improve accuracy. 

However, as described in previous sections, financial returns are often char- 
acterized by volatility clustering and non-Gaussianity. Therefore, several au- 
thors have considered extensions of the previous approaches that allow for 
time-varying volatilities. The most popular approach is to estimate the VaR 
based on Conditional Gaussian GARCH models; see, for example, Christof- 
fersen and Diebold (2000) and Christoffersen et al. (2001). Guermat and 
Harris (2002) even extend further the GARCH approach to allow for kurtosis 
clustering. 

Recently, Engle and Manganelli (1999) have proposed a conditional quan- 
tile estimation based on the CaViar model given by 

VaR t = fa + 0iVaRt-i + 2 \yt-i\ (4.3) 

Gourieroux and Jasiak (2001) describe several alternative methods to esti- 
mate the VaR, focusing on their main advantages and limitations. Tsay (2002) 
also describe several of these methods and compare their performance to es- 
timate the VaR of daily returns of IBM stocks. In particular, he compares the 
RiskMetrics methodology developed by J.P. Morgan, GARCH models, non- 
parametric estimation, quantile regression and extreme value, finding substan- 
tial differences among the approaches. 

Given that, as we have mentioned already, the distribution of high frequency 
price increments is non-Gaussian, and even in many cases the conditional 
distribution of GARCH models is not Gaussian, many authors suggest us- 
ing bootstrap techniques to avoid particular assumptions on the distribution 
of factors beyond stationarity of the distribution of returns; see, for example, 
Barone-Adessi et al. (1999), Barone-Adessi and Giannopoulos (2001) and 
Vlaar (2000). Ruiz and Pascual (2002) review the use of bootstrap methods to 
estimate the VaR. 

Although there is a huge number of papers devoted to analyze methods to 
estimate the VaR as a measure of financial risk, this measure is not without 
criticisms; see, for example, Szego (2002) and the papers contained in the 
especial number of the Journal of Banking and Finance, 26. There are several 
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new measures of risk proposed as remedy for the deficiencies of VaR as, for 
example, Conditional VaR (CVaR) and Expected Shortfall. 

2.4.4 Estimation of diffusion processes 

There are two relatively independent lines in financial modeling: conti- 
nuous-time models typically used in theoretical finance and discrete-time mod- 
els favored for empirical work. The continuous-time models are dominated by 
the diffusion approach. In contrast to stochastic differential equations used in 
discrete-time models, stochastic differential equations are widely used to de- 
scribe continuous-time models in the theoretical finance literature. The stochas- 
tic processes characterized by the stochastic differential equations are Ito pro- 
cesses, and continuous-time model assumes that a security price St follows 
the stochastic differential equation: 

dS t = ntStdt + otStdWt t 6 [0, T] (4.4) 

where Wt is a standard Wiener process, jit is called diffusion drift in proba- 
bility or instantaneous mean rate of return in finance and of is called diffu- 
sion variance in probability or instantaneous conditional variance (or volatil- 
ity). The celebrated Black-Scholes model corresponds to (16) with constants 
pit and at. Given that financial time series tend to be highly heteroscedastic, 
the general modelization assumes that of is random and itself is governed by 
another stochastic differential equation. 

For continuous-time models, the "no arbitrage" condition, as we have ex- 
tensively developed in section 2, can be characterized by a martingale measure, 
that is, a probability law under which St is a martingale. Prices of options and 
derivatives are then the conditional expectation of certain functionals of S un- 
der this measure. The calculations and derivations can be manipulated by tools 
as the Ito lemma and Girsanov theorem; see Karatzas and Shreve (1991) or the 
overviews in Dixit (1993) and Merton (1990). 

The log price process Xt = log(St) after the Ito lemma and from (16) 
follows the diffusion model 

dX t = fa + <J 2 t /2)dt + a t dW t , (4.5) 

where the drift for X t has a term of. GARCH models are used to represent 
statistically the increments of the log price process, so from the diffusion point 
of view, (17) is also a natural parametrization of the GARCH drift fik- 

While the models are written in continuous-time, the available data are 
mostly sampled discretely in time. Ignoring this difference can result in incon- 
sistent estimators (see, e.g., Merton (1980)). A number of statistical/econome- 
tric methods have been recently developed to estimate the parameters of a 
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continuous-time diffusion without requiring that a continuous record of ob- 
servations be available. 

The methods of moments together with simulation estimations have been 
used by Gourieroux etal. (1993) and Gallant and Tauchen (1996). A forceful 
criticism of simulation-based method-of-moments estimation has been that this 
method does not provide a representation of the observables in terms of their 
own past as do maximum likelihood based on a conditional density and time 
series methods such as ARIMA, ARCH and GARCH modeling; see Jacquier 
et al. (1994). Gallant and Tauchen (1998) use the notion of reprojection to let 
a representation of the observed process in terms of observables that incorpo- 
rates the dynamics implied by the possibly nonlinear system under consider- 
ation. They propose a methodology for estimation and diagnostic assessment 
of several diffusion models of the short rate expressed as a partially observed 
system of stochastic differential equations. The theoretical support of the pro- 
jection method was provided by Gallant and Long (1997) who showed that it 
achieves the same efficiency as ML. 

Nonparametric density-matching methods have been applied in Ait-Sahalia 
(1996a, 1996b). Discretely observed diffusions have also been fit by estimat- 
ing functions; see Kessler and S0rensen (1999) and Kessler (2000). A Monte 
Carlo Markov Chain (MCMC) based method is proposed in Eraker (2001). 
The method is applied to the estimation of parameters in one-factor interest- 
rate models and a two-factor model with a latent stochastic volatility compo- 
nent. 

Elerian et al. (2001) propose a new method for dealing with the estima- 
tion problem of stochastic differential equations that is likelihood based, can 
handle nonstationarity, and is not dependent on finding an appropriate auxil- 
iary model. As they point out, their idea is simply to treat the values of the 
diffusion between any two discrete measurements as missing data and then to 
apply tuned MCMC methods based on the Metropolis-Hasting algorithm to 
learn about the missing data and the parameters. 

As in most contexts, provided one trusts the parametric specification in the 
diffusion, ML is the method of choice. The major caveat in the present context 
is that the likelihood function for discrete observations generated by the para- 
metric stochastic differential equation cannot be determined explicitly for most 
models. Since the transition density is generally unknown, one is forced to ap- 
proximate it. The simulation-based approach suggested by Pedersen (1995), 
has great theoretical appeal but its implementation is computationally costly. 
Durham and Gallant (2002) examine a variety of numerical techniques de- 
signed to improve the performance of this approach. 

If sampling of the process were continuous, the situation would be simpler. 
First, the likelihood function for a continuous record can be obtained by means 
of a classical absolutely continuous change of measure. Second, when the sam- 
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pling interval goes to zero, expansions of the transition function "in small time" 
are available in the statistical literature and some calculate expressions for the 
transition function in terms of functionals of a Brownian Bridge. Available 
methods to compute the likelihood function in the case of discrete-time sam- 
pling, involve either solving numerically the Fokker-Plank-Kolmogorov partial 
differential equation (see Lo (1988)) or simulating a large number of sample 
paths along with the process is sampled very finely (see Pedersen (1995)). 
Neither methods produces a closed-form expression to be maximized over the 
parameter: the criterion function takes either the form of an implicit solution 
to a partial differential equation, that could be approximated by a sum over the 
outcome of the simulations. Using Hermite polynomials, Ait-Sahalia (2002) 
provides an explicit sequence of closed-form functions. It is shown that it con- 
verges to the true (but unknown) likelihood function. It is also documented 
that maximizing the sequence results in an estimator that converges to the true 
ML estimator and shares its asymptotic properties. 

As we have pointed out in section 3, high-frequency financial data are not 
only discretely sampled in time but the time separating successive observa- 
tions is often random. Ait-Sahalia and Mykland (2003) analyzes the conse- 
quences of this dual feature of the data when estimating a continuous-time 
model. More precisely, they measure the additional effect of the randomness 
of the sampling intervals over and beyond those due to the discreteness of the 
data. They also examine the effect of simply ignoring the sampling random- 
ness and find that in many situations the randomness of the sampling has larger 
impact than the discreteness of the data. 

As we have described previously, continuous-time models, dominated by 
the diffusion approach, are typically favored in the theoretical finance while 
discrete-time models, mainly of the ARCH type, are the focus of empirical 
research. Nelson (1990) tried for the first time to reconcile both approaches, 
showing that GARCH processes weakly converge to some bivariate diffusions 
as the length of the discrete time interval goes to zero. Later, Duan (1997) 
proposed an augmented GARCH model and derived its diffusion limit. These 
authors link the two types of models by weak convergence. Consequently, it 
is rather common to apply the statistical inferences derived under the GARCH 
model to its diffusion limit. However, recently Wang (2002), using the Le 
Cam's deficiency distance, shows that the GARCH model and its diffusion 
limit are asymptotically equivalent only under deterministic volatility. He con- 
cludes that, for modelling stochastic volatility, if a diffusion model is preferred, 
it is statistically more efficient to fit data directly to the diffusion model and 
carry out the inference. 
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2.5 Conclusions 

Throughout the paper we have summarized several applications of proba- 
bilistic and time series models in finance. We have specially focused on those 
pricing models reflecting the absence of arbitrage and free-lunch. Almost all 
of them are characterized by the existence of equivalent martingale probability 
measures (or risk-neutral measures). Thus the martingale property permits to 
price, hedge, speculate or compose efficient portfolios since future prices must 
verify the random walk assumption. 

However, there are still many open problems that will merit future research. 
So, the absence of arbitrage (free-lunch) does not always lead to martingales, 
even it one focuses on perfect markets. When dealing with incomplete markets 
there are infinitely many risk-neutral measures and it is necessary to establish 
coherent criteria in order to choose the adequate one. For imperfect markets 
we will never have a unique risk-neutral measure and it is also necessary to find 
appropriate instruments in order to relate risk-neutral measures and hedging or 
efficient strategies. 

Most of the concrete pricing models applied in practice are characterized by 
stochastic differential equations reflecting the market dynamic behavior. By 
manipulating the stochastic equation it is possible to obtain the partial differ- 
ential equation or the risk-neutral measure leading to pricing or hedging rules, 
as well as, to those usual topics of asset pricing theory. Time Series and Econo- 
metric Models are the key when designing these pricing models and calibrating 
or evaluating its empirical possibilities. Furthermore, the growing complexity 
of real markets, characterized by more and more connections amongst them 
all, higher and higher volatilities, more and more complex risks and securi- 
ties, and a increasing number of investors, make it rather necessary to improve 
those models usually applied when dealing with pricing issues or interest-rate 
linked topics. 

Summarizing, probabilistic and time series approaches play a crucial role in 
finance, and it is emphasized if one focuses on arbitrage pricing theory. More- 
over, the level of development of current markets makes it essential to improve 
and enlarge our knowledge about all the involved fields, from theoretical foun- 
dations to empirical applications. 
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Abstract 



The paper is a review on the problem from stochastic geometry stated in the 
title. This problem concerns anisotropy quantification of fibre and surface pro- 
cesses. The stereological equation connecting the rose of directions and the rose 
of intersections (for a specific test system) was first attacked by means of ana- 
lytical methods. Later on, an analogue from convex geometry lead to a deeper 
investigation using the notion of a Steiner compact. Various estimators of the 
rose of directions and their properties are reviewed in the planar and spatial 
case. The methods are important for practice when quantifying real structures in 
material science, biomedicine, etc. 



Introduction 

In the model based approach of stochastic geometry, objects are modelled 
by means of random sets [Matheron, 1975]. The isotropy of a random set 
can be defined by means of the invariance of its distribution with respect to 
any rotation operator. The deviance from this property is called anisotropy. 
Anisotropy is thus a rather broad notion. One can imagine the anisotropy of 
spatial distribution of objects which may form chains of preferred orientation 
violating thus the isotropy assumption. This type of anisotropy is formalized 
and studied e.g. in [Stoyan & Benes, 1991]. Special models of random sets 
are fibre and surface processes where besides anisotropy of spatial distribution 
a simpler type of anisotropy may be described by means of the distribution 
of tangent, normal orientations of the fibres, surfaces at each point where it 
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is defined, respectively. This probability distribution TZ is called the rose of 
directions and will be of main interest in this paper. 

In classical stereology, the information on geometrical objects is derived 
from observations on lower dimensional probes (test systems). A well-known 
stereological inverse problem (first formulated in [Hilliard, 1962]) relates the 
rose of directions to the rose of intersections between the process and a test 
system. In its simplest form it can be derived from the Buffon needle prob- 
lem formulated in geometrical probability in 1777. The rose of intersections 
Pl(u) is defined as the mean number of intersections between the process and 
a unit test system of orientation u. Given observed intersection numbers the 
stereological relation is used to the estimation of the rose of directions. There 
are several approaches to the solution of this problem. An analytical solution 
of the integral equation leads to various difficulties. We review estimators of 
the rose of directions separately in the planar and spatial case since the back- 
ground is qualitatively different. Probably the most promising is the approach 
which makes use of an analogy from convex geometry which relates the sup- 
port function of a zonoid to its generating measure. Statistical properties of 
the estimators such as consistency are reviewed and a comparison of methods 
and models is done by means of the simulated distribution of the Prohorov dis- 
tance between the estimated and true rose of directions. Various test systems 
are investigated and demonstrating examples added. 

3.1 An analytical approach 

Consider a stationary planar fibre process $ which is a random element in 
the measurable space Af of fibre systems (collections of smooth fibres), see 
[Stoyan, Kendall & Mecke, 1995]. Let P be the distribution of $, La the 
intensity (mean fibre length per unit area) and 72. the rose of directions. A 
realization <fi 6 Af of $ is alternatively interpreted as a locally finite length 
measure on R 2 , i.e. <p(B) is the length of fibres from <fi in a Borel set B. 
Denote w(x) the tangent orientation at a fibre point x. Axial orientations from 
II = [0, 7r) are considered. 

3.1.1 A general stereological relation 

First a more general stereological relation in R 2 is derived, cf. [Mecke & 
Stoyan, 1980]. Let v& denote the d— dimensional Lebesgue measure. From the 
Campbell theorem [Stoyan, Kendall & Mecke, 1995] it follows immediately 
for an arbitrary non-negative measurable function / on R 2 X II 

f [ f(x,w(x))4>(dx)P(d4>)=L A [ [ f(x,a)K(da)v 2 (dx). (1.1) 
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LEMMA 1 Let g : R 2 — ► R be a non-negative measurable function, </> S N. 
Then 

J Yl g(xi,x 2 )u l (dx 2 )= g(x)sm(w(x))^)(dx), (1.2) 

where x = (xi,x 2 ) ;w ^ 2 - 

Proof: A simple argument based on the total projection is used. For a Borel set 
BcE 2 ands = l B it holds 

/ Yl lB(xi,x 2 )vi(dx 2 )= / sin(w(x))<f>(dx), 

xi:(xi,X2)e<t> 

since the both sides correspond to the length of the total projection of onto 
X2-axis. Using the standard measure theoretic argument, formula (1.2) is ob- 
tained. □ 

THEOREM 2 Let / : R X II — * R be a measurable non-negative function, 3> 
a stationary fibre process in R 2 . For the intersection of<& with x\-axis it holds 

E Yl f(xi,w(xi,0)) = L A f(xi,a)smaK(da)vi(dxi). 

aci - .(»i,0)e* 

(1.3) 

Proof: Let g : R — > R + be a measurable function such that f g(t)u\{dt) = 1. 
It holds using (1.1), (1.2) and stationarity 

La I I g{x 2 )f(xi,a)s\natll(da)i> 2 (dx) = 

= E I I g(x 2 )f(xi,w(x))smw(x)$(dx) = 

= E I Y2 9{x2)f{xi,w{x\,x 2 ))vi{dx 2 ) = 

Xi:(xi,X2)e$ 

= E J g{x 2 )^(dx 2 ) ]T f(xi,w{xi,0)). 

ai:(xi,0)e* 

D 
The intersection of $ with xi-axis forms a stationary point process ^, de- 
note its intensity Pi,. Using special forms of/ in Theorem 2, the relations are 
obtained between the fibre process and the induced structure on the test line 
(here xi-axis). 
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COROLLARY 3 In the situation of Theorem 2 let 9 be the distribution of the 
fibre tangent orientation at the point of intersection with X\-axis. Then it holds 
for pen 

rP 
P L 9([0,(3)) = L A smaK(da), (1.4) 

Jo 



thus 



fn sin alZ( da) 

e([o,0)) = % . , (i.5) 

J sin a H(da) 



ifU({0}) < 1- 



Proof: Putting f(xi,a) — l[ 0i i](a;i)l[o )/ g)(a) in (1.3) one obtains (1.4) and 
using this with f3 = it finally (1.5) is concluded. □ 

EXAMPLE 4 : If TZ({0}) = one can get La and 1Z from Pi, and 9 (the latter 
pair of quantities can be estimated from the observation in the neighbourhood 
of a linear section). From (1,4) it holds 

n{[W)) = T L I (sma)- l 9{da) 

Li A Jo 

and for (3 = n specially 

rir 

L a = Pl f (sina)' 1 9{da). 
Jo 

A simpler choice off(x\,a) = l[o,i](^l) in (1-3) leads to the well-known 
formula 

Pl = L a smaK(da) (1.6) 

which corresponds to the frequent case that the information on intersection 
angles is not available. This case is in fact the main object of our paper, 

3.1.2 Relation between roses of directions and 
intersections 

Let $ be a stationary fibre process in ]R 2 as in the previous paragraph. Let 
Pl(/3), € II, be the rose of intersections, i.e. the mean number of points 
$ n 1(0) per unit length of a test straight line l(/3) with orientation (3. The 
basic integral equation relating the rose of directions of $ to its rose of in- 
tersections is obtained by a simple generalization of (1.6). Consider II with 
addition modulo it. The addition may be interpreted as a rotation of straight 
lines around origin in the plane R 2 . 
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It holds from (1.6) 

Pl(/3) = L A g n (P), (1.7) 

where we denote the sine transform 

Gn(P) = ^ | sin(/? - a)\n{da). (1.8) 

Jo 

In the following text an equivalent expression of formula (1.7) is used. By 

gd-i the un jt sphere in R d is denoted. Characterize a test line in R 2 by its pair 

of unit normal vectors ±u G S 1 , define Pl(—u) = Pl(u). Denote (., .) the 

scalar product. 

Then it holds 

P L (u) = L A F n {u), (1.9) 

where the cosine transform 

^«(«)= / \(u,v)\n(dv). (l.io) 

Js l 

Note that here 1Z represents a centrally symmetric probability measure on S 1 . 
Further by Md, Va the space of finite measures, probability measures on S d , 
respectively, is denoted. If there is no danger of confusion we write M.d = 

M,v d = r. 

Let the test system for a fibre process in R 3 be a plane or its subset char- 
acterized by a unit normal u G S 2 . Denoting by Ly the length intensity of a 
stationary fibre process $ in R 3 and by Pa{u) the intensity of the point process 
induced by $ in the test plane, we have 

P A {u) = L v f \{u,v)\K(dv). (1.11) 

Js 2 

By symmetry, a stationary surface process [Stoyan, Kendall & Mecke, 1995] 

of intensity Sy (mean surface area per unit volume) with a local normal v G S 2 

having an orientation distribution 1Z induces on a test line of direction u G S 2 

a point process with intensity Pl(u) and similarly 

Pl{v) = S v f \(u,v)\TZ(dv). (1.12) 

The generalization to R d for stationary fibre and hypersurface processes with 
intensity A is straightforward; the form of the integral equations (1.11), (1.12) 
remains intact and only the integration region S 2 is replaced by S d ~ l . 

Denote by hi a uniform probability measure on S d ~ 1 . Note that forunknown 
1Z it is possible to estimate A = P^/O d-1 , where Pi = J Pi(u)U(du) can 
be approximated by an average of observations Pl(uj) systematically spread 
on S 4 ' 1 and O d ~ l = /^(«)W(du) is a known constant (O 2 = 2/tt, O z = 
1/2). Therefore in the following the problem of estimating 1Z can be consid- 
ered equivalent to the problem of estimating A7£. 
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3.1.3 Estimation of the rose of directions 

Several methods based on formula (1.9) have been suggested for the estima- 
tion of the rose of directions of a planar fibre process , cf. [Hilliard, 1962], [Di- 
gabel, 1976], [Mecke, 1981], [Kanatani, 1984], [Rataj & Saxl, 1989], [Benes 
& Gokhale, 2000]. The aim is to estimate 1Z given estimators rjj = v( u j) 
of Pl(uj), Uj € S 1 , .7 = 1, ...n, where rfj is the observed number of inter- 
sections per unit test probe of orientation Uj . This was done basically in three 
ways. 

First, if a continuous probability density pofTZ exists we have 

P' L \u) + P L (u) = 2L A p(u), (1.13) 

which yields an explicit solution. This is in practice hardly tractable since the 
second derivative P'l has to be evaluated from discrete data. However, the 
formula is useful when a parametric model for 1Z is available, cf. [Digabel, 
1976]. 

Another natural approach to the solution of (1.7) is the Fourier analysis. 
Hilliard [Hilliard, 1962] showed that for the Fourier images 



pit 

n(k) = / i 

Jo 



e 2lku K(du), k = ...-1,0,1..., (1.14) 



and P L (k) = Jq P L (v)e 2ikv dv, it holds 



n(k) = ^-(l-4k 2 )P L (k), k = ..,-1,0,1,... (1.15) 

When getting Pi(k) from the data and using (1.15), the variances of lt{k) may 
tend to infinity. 

The third approach is based on the convex geometry and will be described 
in a separate section. 

EXAMPLE 5 Consider a fibre system in Fig. 1 with four test lines of equal 
length 1 and the orientations Ui = in/ 4, i = 0,1,2,3, respectively. The 
intersection counts f]i = 6, 3, 7, 7, i — 0, 1, 2, 3. First a parametric approach 
is used for the estimation of the rose of directions. Using a cardioidal model 
[Rataj & Saxl, 1992] for p: 

p(v) = -(l-kcos2(v~v Q )), (1.16) 

7T 



we obtain from (LIS) 



Pl(«) = — (l-£ cob 2(u-u )). (1.17) 

7T O 
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Using the least squares method a fitted curve is obtained for Pl(u), see Fig. 
2 and the estimated rose of directions in Fig, 3, Since the parameter k was 
estimated by a value k — 1.2 which is greater than 1, the model density of 
the rose ofdirections yields also negative values which are presented in Fig, 3 
along the orientation ^. The presence of negative values is a common problem 
of analytical estimators (also those based on Fourier expansions). 




Figure 1. A fibre system intersected by a system of test lines of unit lengths. 




Figure 2. Polar plot of intersection counts rn from Fig. 1 and the rose of intersections fitted 
by means of the cardioidal model. 
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Figure 3. Polar plot of the rose of directions estimated from data on Fig.2 using the cardioidal 
model. Two small loops along ^?- have negative values of radii. 



Consider further the three-dimensional situation. Because of an equal struc- 
ture of integral equations (1.11), (1.12) for fibre and surface processes in R 3 , 
we restrict ourselves to the case of a stationary fibre process <&. The problem 
is again to estimate the rose of directions TZ given a sample of test directions 
t*i, ..., Uk € S 2 and estimators of r/j = rii/A, where nj is the number of inter- 
sections between $ and a planar test probe with an area A and a normal ori- 
entation Ui. Similarly to the planar case and leaving aside the procedure based 
on convex geometry, there are basically two other approaches to the solution. 

First a parametric approach means that a parametric type of the distribution 
on the sphere is suggested and the parameters estimated from the data using 
(1.11). In [Cruz-Orive et al, 1985] the axial Dimroth-Watson distribution was 
used 



TZ(du) = const. exp(KCOs(2 , d))du, 



where u = (i9,y) in spherical coordinates, i9 € [0,7r/2] being the colatitude 
and ip € [0, 27r) the longitude. The parameter k € M. 1 is estimated. 

Secondly an inversion formula to (1.11) is available ([Hilliard, 1962], [Mecke 
& Nagel, 1980]) using spherical harmonics. It is based on the fact that spher- 
ical harmonics are eigenfunctions of the cosine transform (1.10). The method 
in [Kanatani, 1984] approximates rfi by a finite series of even spherical har- 
monics and the inverse is then evaluated directly. An explicit inverse formula 
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from [Mecke & Nagel, 1980] says 

p(v) = ^JT ^-^ [ P A (u)Q2n((u, v))U(du), (1.18) 

LV^ Q Cn J S 2 

where Q m is a Legendre polynomial of order m, p the probability density of 
1Z (with respect to U). The constants c n are 

(~ir +1 1.3.5 (2n-l) 

Cn ~ 4n 2 -l 2.4.6 (2n-2)' n_U ' 1 '-" 

To conclude, analytical solutions of the inverse problem (1.9) in both two 
and three dimensions may lead to estimators of the rose of directions which are 
not non-negative densities. Typically these methods are not useful for sharp or 
multimodal anisotropies. 

3.2 Convex geometry approach 

In this section first some notions from convex geometry will be recalled (see 
e.g. [Schneider, 1993]). Let K., K,' be the system of all compact convex sets, 
nonempty compact convex sets in R d , respectively. If K £ K,' then for each 
u €: S d ~ l there is exactly one number h(K, u) such that the hyperplane (line 
in R 2 , plane in R 3 ) 

{x € R d : (x, u) - h(K, u) = 0} (2.1) 

intersects K and (x, u) — h(K, u) < for each x € K. This hyperplane is 
called the support hyperplane and the function h(K, u), u € S^ 1 , is the sup- 
port function (restricted to 5 rf_1 ) ofK. Equivalently, h(K, u) = sup{(a; ) u), x G 
K}. Its geometrical meaning is the signed distance of the support hyperplane 
from the origin of coordinates, h(K, u) + h(K, —u) = w(K, u) is the width of 
K - the distance of the parallel support hyperplanes. The important property 
of h(K, u) is its additivity in the first argument: h(Ki + K%, u) = h(K\ t u) + 
h(Kz, u) (the addition of sets on the left hand side is in the Minkowski sense). 
Convex bodies with the centre of symmetry will be considered mostly in what 
follows. They will be shortly called centred if this centre is in the origin of TZ d . 

A Minkowski sum of finitely many line segments is called a zonotope. Be- 
sides its being centrally symmetric, also its two-dimensional faces are centrally 
symmetric. Consequently, regular octahedron, icosahedron and pentagonal do- 
decahedron are not zonotopes. On the other hand in R 2 , all centrally symmetric 
polygons are zonotopes. 

Consider a centred zonotope 

fc 
Z = '^2ai[v i ,-Vi}, (2.2) 
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where a* > 0, Vi € 5 d_1 . Its support function is given by 

fe 
/i(Z,M) = ^a i |(«,t; < > | (2.3) 

»=i 

and, conversely, a body Z E K' with the support function (2.3) is a zonotope 
with the centre in the origin. 

Consider the Hausdorff metric on K! 

H(K,L) = max(supd(if, x), sup d(y, L)), 

x&L yeK 

the corresponding convergence is denoted as //-convergence. A set Z 6 K! 
is called a zonoid if it is a //-limit of a sequence of zonotopes. 

Z € K! is a centred zonoid if and only if its support function has a repre- 
sentation 

h(Z,u)= f \{u,v)\(i(dv), (2.4) 

for an even measure n on 5 rf_1 . // is called the generating measure of Z and it 
is unique as shown in [Goodey & Weil, 1993]. For the zonotope (2.2) we have 
the generating measure 

fc 

H = J2o,ie Vi , (2.5) 

i=l 

where e^ = 3(6^ + <5— yi ) and <5.u is the Dirac measure concentrated at u. 

Zonotopes and zonoids have several interesting properties and wide appli- 
cations (see [Goodey & Weil, 1993], [Schneider & Weil, 1983]), e.g. the poly- 
topes filling (tiling) R 3 by translations are obligatory zonotopes (cubes, rhom- 
bic dodecahedrons, tetrakaidecahedrons). The roses of intersections Pl(u), 
Pa(u) are proportional to J g2 \(u,v)\1Z(dv), cf. (1.11), (1.12). Consequently, 
they can be considered as support functions of certain zonoids the generation 
measures of which are proportional to the corresponding roses of directions. 
This idea has been put forward first by Matheron [Matheron, 1975] and the 
corresponding zonoid Z associated to 1Z was called the Steiner compact. Be- 
cause of the uniqueness of the generating measure of zonoids, the association 
is unique. The problem is, as before, to estimate (in atomic form) the generat- 
ing measure /j, or its normalized version 1Z (rose of directions) from rn assumed 
to be the support function values h{Z n ,Ui) of a zonotope Z n estimating Z in 
(2.4). The following theorem can serve as a basis of the procedure. 

THEOREM 6 For a zonoid Z C M and unit vectors ui,...,Uk there always 
exists a zonotope Zk which is the sum of at most k segments and fulfills 

h(Z, ±m) - h{Z k , ±ui), ..., h(Z, ±u k ) = h(Z k , ±u k ). (2.6) 
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If a zonotope Z k satisfying (2.6) is found its generating measure of the type 
(2.5) yields after normalizing to a probability measure the desired estimator 
1Z k of the rose of directions 1Z. Generating measures belong to the space M.. 
The //-convergence on KJ is equivalent to the weak convergence on M with 
respect to the transformation (2.4). Since the weak convergence on Ai is 
metrized by the Prohorov metric, it is possible to describe theoretically the 
quality of the estimator by means of the Prohorov distance between V, k and 1Z. 
The Prohorov distance between measures Q, T € M. is defined as 

r{Q,T) = inf{e > 0; Q{C) < T(C S ) + e, T(C) < Q(C e ) + e 

for all closed C <Z S d ~ 1 }. 

This definition is for probability measures and therefore also in our situation 
equivalent to a restricted condition which is used in the form 

r(n k , K) = inf {e > 0; K k (C) < Tl(C e ) + e for all closed C C S 4 ' 1 }. 

(2.7) 
Because of (2.5) the estimator TZ k is discrete with finite support supp TZ k C. 
{z\, ..., z k } so there is the following reduction to finitely many conditions, cf. 
[Benes & Gokhale, 2000]. It holds 

r(K k ,K) = inf{e > 0; Tl k {C) < K(C e )+e for all C C suppK k }. (2.8) 

This enables to compute the Prohorov distance which will be used in the fol- 
lowing for a comparison of estimators. 

The construction of a zonotope or of a sequence ofzonotopes Z k such that 
H(Z k ,Z) — ► when k — * oo is simple only in R 2 . It is sufficient to set 

k 
Z k = f){x€R 2 ; (x,u)<hi), (2.9) 

i=l 

since every centred polygon is a zonotope in R 2 . In R 3 this is not the case 
thus the situation is more complicated and an optimization procedure based 
on the constructive proof of Theorem 6 in [Campi, Haas & Weil, 1994] is a 
partial solution. Recently the paper [Kiderlen, 2001] makes a substantial step 
forwards in this problem. 

Consequently, the estimation of 1Z by means of the Steiner compact will be 
treated separately for the planar and spatial cases as follows. 

3.2.1 Steiner compact in R 2 

The relation between a measure p, € M. and the zonoid Z generated by it 
has a direct consequence of geometrical nature. Let Tz(u) be the intersection 
point of the support line (corresponding to u) with Z (if the intersection is a 
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line segment, Tz(u) will be the endpoint with respect to the anti-clockwise 
orientation of the boundary dZ of Z). If x, y are two points of dZ by lz(x, y) 
the length of the corresponding arc of dZ is denoted. The following result 
comes from [Rataj & Saxl, 1989] and it was obtained in [Matheron, 1975] in a 
more general setting. 

THEOREM 7 There is a one-to-one correspondence between symmetric ele- 
ments fj, € M and Z € K! centrally symmetric given by 

MM) = lz(T z (s),T z (t)), s,t€S\ 

Consequently, the length (per unit area) of fibres with tangents within an 
interval of directions (t>i, ^2] is proportional to the length of the boundary dZ 
bounded by the pair of equally oriented tangents. 

For a stationary fibre process $ and the zonoid (Steiner compact) Z associ- 
ated to the rose of directions 7Z of 3> it holds 

h{Z,u)= X -L A T n {u), ueS\ (2.10) 

i.e. comparing with (1.7) 2h(Z,u) = Pl(u), u € S 1 . 

[Rataj & Saxl, 1989] suggested a graphical method of estimation of the rose 
of directions by means of its related Steiner compact set. Let 

»-5* = 5T (2J1) 

be the estimators of the support function values at orientations (axial) 
U{ E S 1 , i = l,...,fc, where n* is the number of intersections of the stud- 
ied fibre system (realization of a fibre process) with a test segment of length 
/ and orientation Ui. Then by (2.9), the convex polygon (2k-gon, pi + k = pi, 
i = l,...,k) 

Z k -{x: {x,Ui) <pi, i-l,...,2k} (2.12) 

provides a basis to the estimation of the Steiner compact Z related to 1Z. The 
measure w,. corresponding to Z^ according to Theorem 7 is 

k 

i=i 

where hi are the lengths of edges of the polygon Z^. The /ij's have outer nor- 
mals Ui, in fact Zk may have less edges than 2k if hi = for some i. The 
relation between pi and hi follows (cf.[Benes & Gokhale, 2000], we denote 
a+ = max(a, 0)): 

. . PiCOsPij-pj PiCOSfrj -pj 

hi = ( mm — ^ J - - max —3 -)+, i = l,...,k, 

-■n<t3ij<o sin fy o</3ij<n smpij 

(2.14) 
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where fcj are anticlockwise oriented angles between Ui and Uj. Finally, after 
normalization 

we obtain the desired estimator TZk of the rose of directions 7Z: 

k 

The //-convergence of Zk is investigated by [Rataj & Saxl, 1989]. 



(2.16) 



EXAMPLE 8 We continue in Example 5. This time the data from Fig.l are 
evaluated by means of the Steiner compact method. Using formula (2,12) the 
zonotope in Fig.4 (left) is constructed (recall that the test lines are character- 
ized by its unit normal vectors) and from (2,14) the estimator (2.13) is obtained 
and plotted in Fig.4 (right). The dominant direction is recognized, however, 
the second largest atom at ^p is unrealistic as a consequence of the sparse test 
system. 
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Figure 4. A Steiner compact Zk (left) and the estimated rose of directions (right) for data 
from Example 5. On the right a circular plot is used where hi in (2.16) correspond to the radii 
of classes. 



[Rataj & Saxl, 1989] developed a modification of Steiner compact estima- 
tors of 1Z by means of the following smoothing. For integer n and orientations 
< ui < U2 < ... < u n < it, for integer r and weights 



{cj : j- -r,..,0, ..,r}, c_ 



>0, j = 0,..,r, J2 c i = 1 < 2 - 17 ) 
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they construct polygons 

r 

K n = {x: (x,Ui) <p { , i = 1, ...,n}, where p t = ^ Cjp i+ j, i = 1, .., 



n 



]=-r 



(2.18) 

and/)/ are as in (2.11). Let h{ be the lengths of edges ofK n and /i^ as in (2.15). 
Then the estimator of ft is K n (B) = YIi=i ^M^i), f or a Borel set B C I 2 , 
cf. (2.16). 

EXAMPLE 9 Again for the data from Example 5 we use the modified Steiner 
compact estimator with r = 1 and (c_i, cq, c\) = (^, ^, ■£). In Fig, 5 (left) the 
Steiner compact estimated from the smoothed rose of intersections is drawn, 
the estimator of the rose of directions in Fig. 5 (right) corresponds better to the 
data at the first sight. 
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Figure 5. A Steiner compact (left) and the estimated rose of directions (right) for data from 
Example 5 using the modified method with smoothing described in Example 9. 



There is a theorem in [Rataj & Saxl, 1989] concerning the properties of the 
modified Steiner compact estimator. 

THEOREM 10 Let e > and a € (0, 1). Then there is apian of experiment, 
i.e. integers n,r;ui, . . .u n € S 1 and cj as in (2.17), such that for a planar 
fibre system and pi, K n in (2.18) we have probability 

P(H(K n ,K)<£L A )>a 

under the condition that pi— pi, i = 1, . . . n is a family of independent, centred 
normally distributed random variables with variances bounded by a constant 
a 2 > 0. 

The normality assumption seems to be quite appropriate when using indepen- 
dent test lines, which can be achieved when independent realizations of a fibre 
process are available. 
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3.2.2 Poisson line process 

Any straight line l{x) in the plane can be represented by a point x = (v,y) 
in the parametric space formed by a set C\ = (0, it] x (—00,00). Here v is 
the orientation of the line and y its signed distance from the origin. We have y 
positive, negative for lines intersecting the positive, negative horizontal semi- 
axis in M. 2 . respectively. If v = it, y is positive for lines in the upper half plane. 
We can thus represent a stationary line process $ by means of a point process 
^ on Ci, such that the intensity measure A of the process ^ is (see [Stoyan, 
Kendall & Mecke, 1995]) 



A(d(v,y)) = L A dy1l{dv). 



(2.19) 



If the stationary line process $ is Poisson then the point process ^ is Poisson 
stationary with respect to y coordinate. Conversely, a random point process on 
C\ stationary in y— coordinate defines a stationary line process in R 2 . 

We will investigate the intersections of a line process with test segments 
of constant length I and of varying orientations. Consider the unit semicircle 
x = cos/3, y = sin/3, /3 e [— 7r,7r]. Denote a n = ^ and define the test sys- 
tem T of n segments si inscribed in the semicircle, see Fig. 6a. The segments 
have centres (xj,yj), Xj = cos fy cos a n , yj = sin/3jCOsa n , normal orien- 
tations (3j = (2j — n — l)a n > j = 1, ..., n. The segments have equal lengths 
/ = 2sina n . The total length of T converges to it with n — > 00. Any straight 
line in the plane has at most two intersections with the test system T. Denote 
by Ai, Aij the subsets of C\ corresponding to lines which intersect exactly one, 
two segments, respectively. In Fig. 6b these subsets are drawn in the case of 
n = 3. 




-1 



AT 




A23 




^nJ^X 


<?" 




A 2 




^13 


*X^ M 














IT 




A! 




b) 


V. A 3 





Figure 6. The test system T for n — 3 (a), the corresponding subsets Ai,Aij,i,j = 
1, ...,n, i < j, (b). 



Consider a stationary Poisson line process $ with intensity La an d a rose 
of directions 1Z. Denote Nij, Ni the independent Poisson distributed random 
variables with parameters Aj, A^, respectively, corresponding to numbers of 
intersections of $ with given i— th, i—th and j— th segment, respectively. It 
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holds 

Xij = L A dyTZ(dv), Aj = L A dyTZ(dv). 

JAij JAi 

From a realization of the process $ we get estimators of support function 
values 



Observe that 



cov(pi,pj) = -^varNij, i ^ j. 



EXAMPLE 1 1 The aim is to obtain the probability distribution of the Pro- 
horov distance between the estimator TZ n in (2.16) and a theoretical 1Z. For 
a stationary Poisson line process and a special test system in Fig.6 this can 
be achieved by just simulating the data iVj, Nijfrom the Poisson distribu- 
tion, evaluating the estimators and finally the Prohorov distance. The results 
from 1000 independent simulations for TZ = U uniform yield approximations 
ofprobability density of the Prohorov distance r(R n ,TZ) in Fig. 7 (without 
smoothing), Fig. 8 (with smoothing (2.18)), respectively. 




Figure 7. Estimated probability densities functions of the Prohorov distance 

r(R. n ,U), n = 8, a„ = 0.19, for La = 50 (a), L A = 1000 (b). 



3.2.3 Theoretical properties of the Prohorov distance 
distribution 

If the distance between a discrete and continuous distribution is measured 
we observe that the distribution of the Prohorov distance (cf. Figs.7, 8) is not 
concentrated near zero. Among the discrete distributions 1Z n € V with a sup- 
port T of cardinality at most n the uniform discrete distribution U n (with ex- 
actly n equidistant atoms) is the nearest to Win the sense of Prohorov distance. 
It holds r(U n ,U) = 2n+n s ^ nce tne worst case i n (2-8) is 

1r\F 

l=U n {T)<U{T e )+e = — +s. 

IT 
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Figure 8. The same case as in Fig.7 after smoothing with r = 2, Cj = j^ry , j = — r, ..., r. 



A larger lower bound can be obtained under a supplementary condition 
[Benes & Gokhale, 2000]: 

PROPOSITION 1 For the test system T, a« isotropic fibre process and the 
Steiner compact estimator TZ n ofTZ — U it holds that the Prohorov distance 



r(n n ,K)> 



4c*n 

7T + 2 



under the condition A = [hi = for some i] . 

Proof: Let i be the index which satisfies A, assume that r(R. n ,U) < |^. 
Then there is a S > such that r(1Z n ,U) = |^ — S. We use an equivalent 
definition of the Prohorov distance 



r{TZ n ,U) = inf{e > 0; U{C) < TZ n (C £ ) + e, C closed}. 



Put 



c = [a-^,a + 2 ™- 



7T + 2' 



7T + 2 J ' 



then U{C) 



!q^ and for e = |^ - (5 we have C e 



[A - 2a n + 5, 



A + 2a n - J] and K n (C £ ) = 0. Altogether 7^(C £ ) + e = ^ - 6 < U{C), 
which leads to a contradiction. □ 

A lower bound for Pr(A) is £\ Pr(Bi) - £ i<cj Pr(Bi n Bj), where the 
event B t = [pi_! + p i+1 - 1p { cos ^ < 0]. 

3.2.4 Simulation study 

In this section, the Steiner compact estimation procedure for more complex 
models of fibre systems is investigated which needs a simulation of a realiza- 
tion together with a chosen test system. 
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Figure 9. Realizations of tessellations with the approximate intensity La = 22: (a) a Poisson 
line process, (b) a 2D Poisson- Voronoi tessellation, (c) a planar section of a 3D Poisson- Voronoi 
tessellation (not used in simulations), (d) a planar section of a 3D Johnson-Mehl tessellation. 



The distribution of the Prohorov distance, given the uniform rose of direc- 
tions, the test system in Fig. 6 and estimator (2.13), was evaluated for three 
models in the plane, see Fig. 9. Namely they are the Poisson line process 
, the Poisson- Voronoi tessellation [Stoyan, Kendall & Mecke, 1995] and the 
planar intersection of the three-dimensional Johnson-Mehl tessellation [Ohser 
& Miicklich, 2000] model. Using the algorithm for the Prohorov distance esti- 
mation and 1000 repeated simulations the distribution of Prohorov distance is 
obtained in Fig. 10. It follows that for more regular fibre processes (formed by 
tessellations) the estimator is more precise. 
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Figure 10. Estimated probability densities functions of the Prohorov distance r(TZ n ,U), 
n = 12, a n = 0.131, for La = 100. The result for the Poisson line process is marked by the 
gray dotted line, for the Poisson- Voronoi tessellation by solid line, and for the Johnson-Mehl 
tessellation by the dashed line. 



3.2.5 Curved test systems 

We shall investigate the role of curved test systems in the estimation of the 
rose of directions of a planar fibre process following [Benes & Gokhale, 2000]. 
Consider a test system T' of arcs with finite total length I and T'(B), B € <S 2 , 
the corresponding length measure of T' in B. Assume that almost surely (w.r.t. 
the length measure) the tangent orientation w(x) of T' at x is defined. Then 
the orientation distribution Q of T' on S 1 is given by 

Jf(a)Q(da) = jJf(w(x))T'(dx) 

valid for any / > measurable on S 1 . Denote also by T'{u) the rotation of 
T = T'(0) by an angle of u G S 1 with x— axis. 

[Mecke, 1981] points out that if the test system is formed by curved lines 
with tangent orientation distribution Q € V, then 



jQ 



P£(u) = L A G n * Q _{u), 



(2.20) 



where Pff(u) is the rose of intersections $n7"'(u). Further Q_ is the reflection 
of Q, i.e. J f(u)Q_(du) = f f(n — u)Q{du) for any non-negative measur- 
able function /on S 1 , and 7Z * Q_ is the convolution of measures defined by 
f f(x)Tl * Q.{dx) = J ff(x + y)Tl{dx)Q_(dy). In particular for Q = U 



uniform it follows from (2.20) that P^(u) = f La> u € S 1 , is a constant 
denoted P%(u) = P L . 

Generally, comparing (1.7) and (2.20) we see that if there is a statistical 
method for estimating IZ from (1 .7), the same method estimates TZ * Q_ from 
(2.20) when using a curved test system. Unfortunately, the system V with 
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convolution operation does not posses natural inverse element to solve equation 
1Z * Q = Qi for an unknown 1Z, cf. [Heyer, 1977]. 

Elements 8 U 6 V, u 6 S 1 provide rotation Q(u) = Q * S u of a given mea- 
sure Q € V. The effect of the convolution operation of measures on Steiner 
compact sets may be observed most easily when the both measures are dis- 
crete: k = Er=i a i^i> Q = Y!jLi b i&vj, J2 a i = 11 b j = !. a i> b j > °. 

Ui,Vj € S 1 . Then the convolution 7Z * Q is again a measure with finite sup- 
port {u = Ui + Vj] i = 1, .., n, j = 1, .., m}. The atom in txj + Vj has size 
Oj&j. Now the Steiner compact associated with a discrete measure has form 
Z = J2i = i[—Cij,Cij], cf. (2.2), where c^- are vectors in R 2 with orientations 
Ui + Vj and lengths oj&j. 

The following result comes from [Hilliard, 1962], [Mecke, 1981]. 

PROPOSITION 2 For the Fourier images R(k), Q(k) defined by (1.14) and 
for Pg(k) = /* P^(u)e 2iku du it holds 

il(k)Q(-k) = ^-(l-4k 2 )P^(k), k = ..,-1,0,1,... (2.21) 

Proof: Let / be a ^-periodic twice continuously differentiable function. Then 
So f{u)TZ{du) — \ Jq Qn{u)[f{u) + f"(u)]du using two-fold integration by 
parts. Then putting f(u) = e 2lku we get formula (1.15). Using the same idea 
to 71 * Q_ and using the fact that the Fourier transform of a convolution is a 
product of Fourier transforms we get (2.21). □ 

Further we observe that the local smoothing in (2.18) can be expressed in 
terms of the convolution with a discrete measure Q representing the orientation 
distribution of a test system. 

PROPOSITION 3 Let Q = YT=\ b rfvn b i > °> 12 b i = !> v i € S 1 . 
i = 1, •■,n. Then 

m 

PL(u) = ^2biP L (u-Tv + Vi), ueS\ 

Proof: We have Q_ = Y2i b i^-vi and Gk*q (w) — J£ \sm(u - w)\1Z * 
Q (d«) = 

= 12T=i b i So I «n(n + tv - v { - w)\K(du) = Zti b ^7l(w - * + Vi ). 
Then 



p l( w ) = LaGk*q_( w ) = L A X] b iGn( w -K + Vi) 

m 
= ^2biP L (w-TT + Vi). 



»=1 



t=l 
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□ 

Naturally it is not necessary to restrict to atomic measures Q for local smooth- 
ing; diffuse measures correspond to curved test systems. 

EXAMPLE 12 Let TZ = Sq and Q_ has probability density q(w) = -for 
w 6 [0, a) and q(w) = elsewhere for some a, < a < |. Then 

Giz{w) = sin w and Giz*Q_( w ) = cosw ~ c ° s \ w a ' , w E [0, n — a), with ap- 
parent smoothing effect for a > small. 

It is concluded that curved test systems present an alternative to local smooth- 
ing in (2.18) when estimating the Steiner compact. It should be kept in mind 
that using the rose of intersections P^ (u) (i.e. using local smoothing) we get 
estimators of TZ * Q_ which is not exactly TZ. In R 3 , the convolution opera- 
tion does not exists in a simple form because of the complexity of the space of 
rotations on S 2 . 

3.2.6 Steiner compact in R d and in E 3 

The complications in approximating the zonoid associated to the rose of di- 
rection TZ in R d , d > 3, are consequences of the special nature of zonotopes 
and zonoids. Thus the intersection of supporting halfspaces (2.9) produces a 
centrally symmetric polytope but it is not a zonotope in general because its 
two-dimensional faces need not be centrally symmetric. Also the interpola- 
tion and smoothing procedures do not produce zonoids but only generalized 
zonoids. They are centrally symmetric but their even generating measures are 
not non-negative as required but only signed ones [Schneider, 1993]. Con- 
sequently, the inversion of the integral equation (1.11) proposed in [Hilliard, 
1962], [Kanatani, 1984] need not give a non-negative estimator of the rose of 
direction"/?, as pointed out by [Goodey & Weil, 1993]. 

More correct solutions are based on the Theorem 6 as shown in [Kiderlen, 
2001]. The basic idea is an approximation of the generating measure fj, € M. 
by a measure concentrated on a finite support 

T m = {vi, . ..,%, -v u ... , -v m } C S d ~\ (2.22) 

such that Z m = Y^iLi a i[ v i> ~ v i] is a zonotope estimating a zonoid Z corre- 
sponding to //. The problem is a suitable choice of T m and of the weights a* 
such that Z m — > Z in //-convergence. 

Let $ be a stationary fibre process in R d with intensity A (specially in ]R 3 
we denote A by L v) and the rose of directions TZ. Consider k fixed test hy- 
perplanes uj- with normals Ui € 5 d_1 , i = 1, . . . , k, such that they do not 
contain a common line. Denote rji = #($ D Wi) the number of intersec- 
tion points counted in $ fl Wi, where Wi C uj- are the observation win- 
dows of unit areas fd-i(Wi) = 1 in the test hyperplanes. The set of all r^ 
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then constitutes a random vector r)([i) = {771, ... ,f]k} with the mean value 
E77(/i) = {E771, . . . , E?7fc}, where in R 3 we have Erji = P^(uj), i = 1, . . . , n. 
In contrast to the test system T in the planar case, we assume here that rji are 
independent which can be ensured by examining independent realizations of $ 
for different planes u^-. This assumption is violated in the next section where 
curved or polytopal probes are used for investigation of a single realization. 

The idea of a maximum likelihood (ML) estimator of the measure /j was for- 
mulated in [Mair, Rao & Anderson, 1996] and is further developed in [Kiderlen, 
2001]. Assume that the fibre process $ is a stationary Poisson line process, Tjj 
are Poisson distributed. Further assume that the observed realization 77 of r}(/j,) 
is a non-zero vector. The ML estimator /x maximizes the log-likelihood func- 
tion L(/j,) : fi 1— > log P(r/(/u) = fj), i.e. 

k 

L{ti) = £)(# log(E»a) - E»fc) (2.23) 

t=i 

The convex optimization problem 

(i) to minimize —L(/j,) with respect to /i € Ai 

is shown to have a solution in [Mair, Rao & Anderson, 1996]. It is not unique 

but any two solutions /l*i , \ii are tomographically equivalent, i.e. they satisfy 

Erfc(/ii) = E77i(/i 2 ) 

for all i — 1, . . . , A;. For large k and regularly distributed itj on S d ~ l , the 
Prohorov distance of tomographically equivalent measures is small. 

To solve the problem (i) numerical methods must be used searching for a 
solution in the finite-dimensional subcone M(T m ) C M of measures with 
support in T m . Then the optimization problem (i) reduces to 
(ii) to minimize —L(ii) with respect to jjl 6 M.{T m ). 

There is a choice of T m which is optimal in the sense of the following theorem. 
We will specify this just for d = 3, for general formulation see [Kiderlen, 
2001], where the theorem is proved under assumption that $ is the Poisson 
line process and, consequently, ry(/x) is multivariate Poisson distributed. 

THEOREM 13 Under the above assumptions concerning the choice of test 
planes and f), the problem (ii) has a solution. If T m is the set of all unit vec- 
tors orthogonal to the all linearly independent pairs in {ui, . . . , u^} then any 
solution of (ii) is a solution of (i). 

Clearly m < k(k — l)/2 ford = 3. Denote T^ m ,k the ML estimator of the rose 
of direction based on k test orientations and T m as introduced in the Theorem 
13: jj, = A7£ m> fc.It can be shown that Theorem 13 holds for general stationary 
fibre processes, too. It need not be a maximum likelihood estimator then (the 
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Poisson property of rn may fail), but it is consistent in the following sense 
[Kiderlen, 2001]. An asymptotically smooth sequence {u\, U2, . ■ ■ } 6 R 
is such that the sequence of measures r^ = | 5Z»=i ^«t converges weakly in 
A4 and the limit has a positive density. 

THEOREM 14 Ze? $ be a stationary fibre process in R wzY/z ^u = LyTZ which 
is not supported by any great circle in S 2 and {u\ , U2 , ■ ■ ■ } be an asymptot- 
ically smooth sequence in S 2 . Let rji, ... ,rjk be non-correlated intersection 
counts in unit windows in {u^, . .. ,ut}, respectively, and there exists a con- 
stant c G R such that E(#($ D B d ~ 1 ) 2 ) < c for all unit (d — 1)- dimensional 
balls B d ~ l . 

Then 71 is estimated consistently by the ML estimator in the strong sense, 
i.e. we have 



lim Tl m ,k = Tl 

fc— »oo 



almost surely. 



For the numerical solution p, of problem (ii) the EM algorithm is proposed 
in [Kiderlen, 2001]. 

The second approach to the estimation of 7Z [Kiderlen, 2001] is based on 
an idea of [Campi, Haas & Weil, 1994] and it generalizes the 2D approach 
based on (2.9). Theorem 6 implies the possibility of approximating zonoids by 
zonotopes in fixed directions Mi , . . . , Uf. . Next we are looking for a zonotope 
Z which is contained in a polytope 

k 
Qk = f]{x € R d ; (x,u) < h{Z, Ui )}. (2.24) 

Qk need not be a zonotope in dimension d > 3. Theorem 13 suggests the 
choice of T m which should contain the set of orientations of line segments 
forming the zonotope Z. Then only the lengths of its line segments have to be 
determined. Using Z m — Y^Li a j[~ v ji v j] we g et a linear program 

minimize : J2i=i( h ( Z : u i) ~ Ejll a J I ( v 3> u i) D> 

subject to : Y^jLi a j I ( v ji u i) 1^ h(Z, U{), i = 1, . . . , k, 
otj>0, j = l,...,m. 

It can be derived from Theorem 13 that there exists a solution of this linear pro- 
gram with objective function value 0, which yields the desired zonotope and, 
by optimization theory, at most k of aj > 0. However, the substitution of fji for 
h(Z,Ui) is dangerous in this case because the values of rji substantially lower 
then E77J (their presence cannot be excluded) can produce an estimate 7Z m — 
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with a positive probability. Consequently, it is recommended to replace fji by 
their arithmetic averages obtained by independent replicated sampling. Using 
a numerical optimization procedure to the solution of linear program (LP) the 
estimator of the rose of directions is obtained and a consistency theorem anal- 
ogous to Theorem 14 can be formulated, see [Kiderlen, 2001], where also both 
estimators (EM and LP) are compared. It is concluded that for a smaller sam- 
ple size the maximum likelihood estimator is slightly better while for larger 
sample sizes the linear programming should be preferred because the slightly 
worse performance of the LP estimator is well compensated by its being less 
time consuming. 

3.2.7 Estimation of 3D fibre anisotropy; computer 
simulation 

A 3D analogy of the arc and polygonal test systems for the anisotropy esti- 
mation in R 2 are polyhedral probes. In this subsection, the situation frequently 
used in practice is examined in detail, namely that only a single realization of 
the fibre process is available. Then the assumptions of the Theorem 14 are 
not satisfied because of correlated intersection counts r\i. Three isotropic fibre 
processes (edges of various Voronoi tessellations) were examined by means of 
cubic and octahedral probes and the distribution of the Prohorov distance was 
estimated in [Hlawiczkova, 2001]. Its variance decreases with the growing 
number of probe faces (similarly as with the number of random testing planes 
in [Kiderlen, 2001]) and increases with a growing local inhomogeneity of the 
process as characterized e.g. by the distribution of the tessellation cell volume 
(compare with the 2D results in [V. Benes et al, 2001]. For a more detailed 
study [Hlawiczkova, Ponizil & Saxl, 2001], again the processes of Voronoi 
cell edges have been selected. They represent a continuous passage from a 
pronounced anisotropy of linear and planar types to the complete isotropy. Be- 
side these processes with diffuse roses of directions, also three processes with 
atomic roses have been theoretically considered for the comparison. 

The characteristics of the examined fibre processes are as follows: 
i. The monoclinic point lattice Ho with the lattice vectors |aij = |a2| = |«3|/9, 
(0,1,0,2) = 0.5|ai| 2 generates the isohedral tiling T'q by regular hexagonal 
prisms with the four-valent base edges (the relative weights of their three orien- 
tations are 1/(3 + 2q)) of lengths b and three-valent vertical edges (the relative 
weight of their orientation is 2q/(3+2q)) of length qb . The edge process $0(9) 
with atomic measures was examined in three particular cases: q = 0.2 (thin 
plates producing nearly planar anisotropy), 10 (long rods producing nearly lin- 
ear anisotropy) and 1 (intermediate case). 

ii. Let £ x be i.i.d. random vectors with the Gaussian iV(0, H 2 ) distribution, 
H 2 = a?T, X is a unit matrix and x 6 TLq denotes the lattice points. The 
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displaced lattice or the Bookstein model on fio [Stoyan & Stoyan, 1994] is 
7Y tt = \J xeH (x + £ x ). The tessellation T' a generated by H a is a normal 
random tessellation with three-valent edges and several its characteristics are 
discontinuous at a — 0+ (for details see [Hlawiczkova, Ponizil & Saxl, 2001]). 
The edge process $ a (q) with a diffuse anisotropy measure was examined for 
a = 0.005,0.2,0.5,2 (in the units of the nearest neighbour distance in Hq) 
at the values of q chosen above for $0(9)- For high a, T' a approaches the 
stationary Poisson-Voronoi tessellation and <& a is isotropic for an arbitrary q. 




n 



n 



Figure 11. Enlarged probes in the fio and f2 orientations; the embedding cubes show the 
mutual orientations of the probes and of the tessellated cube but not its true size. 



The tessellations have been constructed in a unit cube by means of the in- 
cremental method with the nearest neighbour algorithm [Okabe, Boots & Sug- 
ihara, 1977]. The number of process realizations was between 500 and 1000. 
Centrally symmetric polyhedral probes (icosahedron, octahedron, dodecahe- 
dron and cube; the results for the first two of them only are shown in what 
follows) of the same surface area {A = 0.8617) have been placed in the centre 
of the tessellated unit cube - Fig. 11. In order to suppress a possible positional 
bias between the tessellation and the probes, each realization was randomly 
shifted as a whole with respect to the cube centre by a random vector 77 with 
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Figure 12. The true discrete roses of directions 1Z (orientations and weights) and the corre- 
sponding factors g (see below, Eq. (2.25)) for $o fibre processes (upper row), the calculated 
estimates 7?.4 5 and g values for the icosahedral probe in the fi orientation (lower row) as ob- 
tained by the EM algorithm. 



the Gaussian N(0, S 2 ) distribution, E 2 = t 2 1 and the value of r was compa- 
rable with the lattice constants of Ho. 

Two orientations of the probes were examined, namely &o (all octahedron 
diagonals parallel to the coordinate axes, one icosahedral diagonal perpendic- 
ular to the {x, y}— plane and two icosahedral edges parallel with the x— axis) 
and CI obtained by rotations from Qq (octahedron rotated by Euler angles 
(<t>,i>,x) = ( 7T /7,7r/3,ir/2), icosahedron rotated by (7r/7, it/9, 7t/2) - see 
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Fig. 11. Their size and the intensities of tessellations A were chosen in such a 
way that the expected total number of intersections per the whole probe E/V 
was approximately constant (EN = 1840) in all considered cases and the 
edge effects were considerably suppressed by confining the examination to the 
central part of the unit cube. 
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Figure 13. The discrete roses of directions 1Z m and g values estimating the fibre processes 
$0,005 at various values of q by means of icosahedral (JZa,s, upper row) and octahedral (He, 
lower row) probes in the fi orientations. Note the considerably weaker performance octahedral 
probe. 
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The rose of directions lZ m approximating the rose 1Z of the examined fibre 
process $ is estimated by the ML procedure described above and the weights 
aii are found by the iterative EM algorithm. 

The atomic measures 1Z of $o are shown and compared with the positions 
and weights of the estimate IZ45 as calculated for the icosahedral probe in 
Fig. 12, circle areas are proportional to the weights «i and their total area 
is 1% of the projected sphere area). It is clearly seen that the description of 
the true atomic measures by 7^45 is rather unsatisfactory; discrete measures 
concentrated in the polar region at q = 0.2 and in the equatorial strip at ^ = 10 
are clearly underestimated. Moreover, the atomic measures in the equatorial 
plane at q = 0.2 are approximated by a broad layer of many weaker atoms. 
The result would be perhaps better for another probe orientation. 

The estimation is more successful in the case of $0.005 processes -Fig. 13 - 
where the diffuse planar anisotropy is reflected much better even when the lack 
of equatorial directions in the case of linear anisotropy in the estimate by the 
icosahedral probe is again surprising. The estimation improves considerably 
when $ a approaches isotropy. 

The effect of the probe orientation with respect to the fibre orientations may 
be quite substantial, in particular when the number of probe faces is small. 
It is shown in [Hlawiczkova, Ponizil & Saxl, 2001] that cubic and octahedral 
probes in the orientation fio are completely "blind" to the changes in anisotropy 
of $0,005 and the estimated roses of directions 7^3 and TZq are identical for 
q = 0.2, 1, 10. Consequently, if there is no preliminary knowledge of the type 
of the examined anisotropy, the combination of several probe orientations is 
always unavoidable. 

Frequently, a simple numerical characteristic of the degree of anisotropy is 
required in practice. If there is some preliminary knowledge concerning the 
type of the examined anisotropy as in the examined case (linear anisotropy 
in the direction of 2-axis), a suitable numerical factor describing the measure 
arrangement and strength can be the ratio 



9 = Yl ai l ]L a i- ( 2 ' 25 ) 



where S e € S 2 is the equatorial strip of area 2tt and S c its complement in S 2 . 
For $o» 9 = 3/(2§), hence g is high in the case of a quasi-planar anisotropy 
and low if the linear anisotropy prevails. 5 ~ 1 when approaching the isotropic 
case. The estimated values of g for $0 and $0,005 are given in Fig's 12 and 13. 
For details see [Hlawiczkova, Ponizil & Saxl, 2001]. 
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3.2.8 Approach to the isotropy, Prohorov distance 

Prohorov distance was used as a characteristic of the estimation quality of 
the rose of direction 1Z in the above citied papers [V. Benes et al, 2001], [Hlaw- 
iczkova, 2001], [Kiderlen, 2001]. In [Hlawiczkova, Ponizil & Saxl, 2001], a 
different goal is followed by means of its estimation, namely the approach to 
the isotropy of the examined $ a (?) with growing standard deviation a of lat- 
tice point shifts. It will be described by the decrease of the Prohorov distance 
between 7^6 as estimated by the octahedral probe and uniform rose of direc- 
tions U = 1/47T with growing a. The estimates of the corresponding pdf s of 
r(R.m, TZ) are shown in Fig. 14 (Epanechnikov kernel estimator with the band 
width h = 0.02 was used). 




Figure 14. The probability density functions of the Prohorov distance r{TZe,U) for 3> as 
determined by the octahedral probe. 



The approach of all $ a (<?) to an isotropic fibre process with increasing stan- 
dard deviation a is clearly documented by pdf 's of the corresponding r(TZe, U); 
they shift to smaller values (slowly at a < 0.1) and coincide at a = 2 - Fig. 
14. The standard deviations of distance distributions are comparable as the 
local inhomogeneity of the examined tessellations is similar. The Prohorov 
distances are rather high as the approximation of a quasi-linear and quasi- 
planar anisotropies is difficult with a generally oriented probe. A similar result 
presents the consideration of the g factor (see Fig. 6 in [Hlawiczkova, Ponizil 
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& Saxl, 2001]), namely nearly constant values in the interval a 6 [0.005,0.1] 
and then a quick approach to the isotropic value g = 1. However, the values 
of g for the process $o as calculated from the corresponding 1Z m are biased. 
The correct values are {7.5, 1.5, 0.15} and their estimates by 7^45 are {3.78, 
0.97, 0.08} at q = 0.2,1,10, resp., (see Fig. 12). The smaller is the number of 
probe faces the greater is the bias. Note that the negative bias would describe 
a more pronounced linear anisotropy at q = 10, whereas in the remaining two 
cases would the estimated planar anisotropy be weaker. A further examination 
should elucidate whether a greater number of probe faces or a combination of 
several probe orientations would give better and more reliable results. 

Conclusion 

The problem of the estimation of the rose of directions of fibre and surface 
processes from the rose of intersections has a long history but it is not yet 
satisfactorily solved. It is unpleasant that while the basic integral equation 
has the same form for any dimension, surprisingly the properties of theoretical 
tools for the solution of this equation differ substantially from the planar to the 
spatial case. 

For analytical methods is this difference not so essential but this confirms 
the fact that analytical methods are not deep enough to produce reliable solu- 
tions. Typically we obtain negative values of densities in the solution. In the 
stochastic approach are statistical properties of the estimators poor. This con- 
cerns even the planar situation and problems increase when dealing with the 
spatial case. 

Convex geometry yields excellent tools for the investigation of the basic 
integral equation. The analogy between the rose of intersections and the sup- 
port function of a zonoid is striking. The zonotopes converging to the zonoid 
corresponding to the desired rose of directions are thus already the desired es- 
timators. Their construction in the plane is simple and we can say that this 
approach leads to good estimators even for sharp or multimodal anisotropics. 

Problems arise when applying the Steiner compact method of estimation 
of the rose of directions in the space. A natural extension of the planar es- 
timator is not available because of the properties of zonoids and zonotopes 
in higher dimensions. Still two constructions were suggested based either on 
linear programming techniques or EM-algorithm for the maximum likelihood 
estimation. 

The Prohorov distance between the true and estimated rose of directions is 
used as a measure of quality of the estimator. It enables comparison between 
various methods. Since the estimator is typically a discrete measure (based 
on observations from a finite set of test line orientations) this distance is not 
concentrated near zero. Simulation methods are used to verify new estimators 
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and distribution of the Prohorov distance is plotted. The results presented here 
are the first systematic trial and there is still a lot of work to be done in order to 
understand the properties of estimators (especially in the spatial case) properly. 
The survey is concentrated on a single complex problem, there are also re- 
lated problems concerning anisotropy. The anisotropy of spatial distribution of 
objects is mentioned in the Introduction. A more general Stereological formula 
derived in Section 1 makes possible the use of a local angular information 
around the Stereological probe. For surfaces of particles, there is a variant 
of the rose of normal directions considering only outer normal vectors to the 
particles. This rose of directions is examined in several papers, e.g. [Rataj, 
1996], [Weil, 1997], [Schneider, 2001]. Several authors considered also the 
anisotropy estimation for thick fibre systems modelled by Boolean models 
[Molchanov & Stoyan, 1994], [Karkkainen, Vedel Jensen & Jeulin, 2001]. 
These problems, however, lead to different concepts of stochastic geometry 
and were not aimed to be discussed here. 
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Abstract In this article Poisson-type and compound Poisson approximations are discussed 

for a multiple scan statistic for Binomial and Poisson data in one and two di- 
mensions. Numerical results are presented to evaluate the performance of these 
approximations. Direction for future research and open problems are also stated. 



4.1 Introduction 

In this article we discuss Poisson-type and compound Poisson approxima- 
tions for multiple scan statistics for independent and identically distributed 
(iid) integer valued random variables from a binomial or a Poisson distribu- 
tion. Both one dimensional and two dimensional scan statistics are considered. 
The multiple scan statistics are discussed both for the unconditional case and 
for the case when the total number of the observed events is known (condi- 
tional case). One dimensional multiple scan statistics for iid Bernoulli random 
have been discussed in Chen and Glaz (1996) and Balakrishnan and Koutras 
(2002). 

One dimensional multiple scan statistics for continuous data are discussed 
in Glaz, Naus and Wallenstein (2001, Ch. 17). Approximations for multi- 
dimensional multiple scan statistics are discussed in Barbour and Mansson 
(2000) and Mansson (1999a, 1999b and 2000). 

This article is organized as follows. In Section 2, we present Poisson-type 
and compound Poisson approximations for the one dimensional multiple scan 
statistic, both conditional and unconditional case. We also derived Bonferroni- 



98 RECENTS AD VANCES IN APPLIED PROBABILITY 

type inequalities for the binomial model in the unconditional case. Since these 
inequalities have not performed well, we have not derived them for other cases. 
In Section 3, we present Poisson-type and compound Poisson approximations 
for the two dimensional multiple scan statistic, both conditional and uncondi- 
tional case. In Section 4 numerical results are discussed for the approximations 
derived in this article. Concluding remarks are presented in Section 5. 

4.2 The One Dimensional Case 

Let X\ , • ■ ■ , Xn be iid nonnegative integer valued random variables fol- 
lowing a binomial or a Poisson distribution. First we consider the uncondi- 
tional case, when the total number of events Yli=i X\ is unknown. For integers 
1 < 3 < N - m + 1 and A; > 2 let 

Aj = {Xj + ■ ■ ■ + X j+m _i > k) (2.1) 

and 



/• = / *' if ^j° ccurs (2 2) 

3 \ 0, otherwise. 



For integers 2 < m < N define a discrete scan statistic. 

S m = S m (N) = max{Xi + •■■ + X i+m _ i; 1 < i < N - m + 1}. (2.3) 

We say that a scan statistic of size k has been observed if S m exceeds the value 
k — 1. Approximations for the distribution of S m , applications and references 
are given in Glaz, Naus and Wallenstein (2001, Ch. 13). In this article we are 
interested in approximations for the distribution of a multiple scan statistic of 
size k defined as: 

N-m+l 

where Ij is given in Equation (2.2). For 1 < n < N — m + 1, a Pois- 
son,approximation for P(£ > n) is given by 



n_1 -Av 
— e A 



P{(,>n)*l~Y,~T> ( 2 - 5 ) 

i=o 

where 

\ = E(0 = (N-m + l)P(A 1 ). 

Since the events Aj , 1 < j < N — m + 1 , tend to clump, the Poisson approx- 
imation given in Equation (2.5) performed poorly for P(£ > 1) (Chen and 
Glaz 1999). Following the approach in Chen and Glaz (1997) the following 
Poisson-type approximation will be investigated. For 1 < j < N — m, let 

I* = / 1 > lfA i ° A J+1 n ■-■ n AC minU+m-l,N-m+l) °CCUrS q q. 

3 ) 0, otherwise v " ' 
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and 



_ I 1, if Apf—m+l 

~ \ 0, otherwise. 



r * j x, u /-lAr-m+i occurs 

I N-m+l 



By defining the indicators P" we are not allowing the events Aj to clump. A 
Poisson-type approximation for P(£ > n) to be examined here is given by 

— e A A 



P($>n)»l-2^-, (2.7) 

where 



»=o 



(N-m+l \ 
J2 I j\= 1 - ?2m-2 + (N - 2m + 2)( ?2 m-2 - <fcm-l) (2.8) 

and form<j<N 

(j-m+l \ 
n ^)- < 2 - 9 ) 

Numerical results for this Poisson-type approximation are given in Section 4, 
Tables 1 and 2. 

A compound Poisson approximation for £ based on the approach in Roos 
(1993, Lemma 3.3.4) is given by: 



n-l / 2m-l x /3i\ 2m- 1 
^ 

7=0 \ / 0i+2/?2+....+(2m-l)/?2m-i=7 *=1 / »=1 



p« > n) « l- e 5: n ^t) ex p(- e a <)> 

7=0 \/9i+2 / 9 2 +....+( 
where ft are non-negative integers, 



Xi = (N - m + 1)tt(1 - p) V" . t = l l -,m-l, 
Ai = (iV-m + l)7r [2(1 _ ^ + (2m _ . _ 2)(1 _ ^i-ij ^ 



^2171-1 = 



m < i < 2m — 2, 

(iV-m+l)(l-g m )p 2 — 2 



2m -1 
and 

7T = P(h = 1) = 1 - q m , 

p = P(/i = 1, 7 2 = 2)/P(/ a = 1) = (1 - 2<? m + gm+ i)/(l - q m ). 
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Numerical results for this compound Poisson approximation are presented in 
Section 4, Tables 1 and 2. 

We now discuss Bonferroni-type inequalities for P(£ > n). Consider the 
events Aj, 1 < j < N — m + 1, defined in Equation (2.1). Let i>N-m+i be the 
number of A'. : s that have occurred. Then 

P(i/jv_ m+ i > n) = P(e > n). 

For 1 < j < 3, let 



Sj = 



l<ii<....<ij<Af-m+l 




It follows from Galambos and Simonelli (1996, pages 118-119) that: 

6s 3 - 2(2z + iV - m - l)a 2 + »(* + 2iV - 2m + l)ai 
(? - nj ~ (N-m-n + 2)(i + l-n)(i + 2-n) 

(n - l)[(2i - n)(N - m + 1) + 2(N - m + 1)] 
(N -m-n + 2)(i + 1 - n)(i + 2 - n) 
(n - l)[n 2 - 2m - 3n + z 2 + 3i + 2] 
(AT - m - n + 2)(t + 1 - n)(i + 2 - n) ' 

forl<n<iV — m — l,n<i<iV — m — 1 and 

t(2n + 1 - l)«i - 2(n + 2» - 2)s 2 + 6s 3 



P(£ > n) < 



i(i + l)n 



forl<n<JV — m — 1, m+l<i<n — 1. Numerical results for these 
Bonferroni-type inequalities are presented in Section 4, Table 1. 

We now discuss approximations for a multiple scan statistic when the total 
number of events X)i=i Xi = ais known. For 1 < j < N — m + 1 and k > 2, 

let 

N 

A* = {Xj + ■ ■ ■ + Xj+m-x >k\^Xi=a) 

i=i 
and 

j / \ _ / 1) if -4*j occurs 
J '^ ' — I 0, otherwise. 



The conditional multiple scan statistic is defined as: 

N-m+l 

£(a) = £ Ij{a). 
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For m < j < N, set 

N-m+l /j-m+1 

qj (a) = P(S m (j)<k-l\ £ Xj =a) = p\ D A* c 

A Poisson-type approximations for P (£(a) > n) can be obtained from Equa- 
tions (2.7) and (2.8) by replacing the terms <72m-2 and <72m~i with <Z2ro-2(a) 
and ?2m-i( a )j respectively. Let 

Xi(a) = (AT - m + l)7r(a)(l - p(o)) 2 p(a) i ~ 1 , 

for 1 < i < m — 1 , 

. . . (N ~ m + l)7r r ^_ x 

Ai(o) = i : ^ 2(1 - p(a))p(a) 1 

^ L 

+ (2m - t - 2)(1 - p(o)) 2 p(a) <_1 
for m < i < 2m — 2, 

x ( ^ (iV-m + l)(l- gm (a)Ma) 2rn - 2 

and 



p(a) = 



n(a) = P(h(a) = I) = I - q m (a), 
P(h(a) = l,h(a) = 1) _ 1 - 2q m (a) + q m +i(a) 



P(h(a) = l) (l-9m(o)) 

A compound Poisson approximation for P (£(a) > n) can be obtained from 
Equation (2.10) by replacing the terms Aj, 7r and p with Aj(a), 7r(o) and p(a), 
respectively. Numerical results for these approximations are presented in Sec- 
tion 4, Tables 3 and 4. 

4.3 The Two Dimensional Case 

Let Xij,i = l,--- ,N\ and j = 1, • • • , N%, be iid nonnegative integer 
valued random variables with a binomial or a Poisson distribution. Let 

t2+m2— 1 ti+mi — 1 

*ii,*2 = 2_^ 2_^ ^*,i> (3-1) 

j=i 2 i=h 

where 1 < i\ < Ni — mi + land 1 < 22 < ^2 ~ "^2 + 1- The two-dimension 
scan statistic is defined as: 

S mi ,m 2 = ma % {^ 1 ,i 2 ; I < h < Nx - mi + 1, 1 < i 2 < N 2 - m 2 + 1} . 

(3.2) 
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Approximations for the distribution of S mjt m 2 , applications and references are 
given in Glaz, Naus and Wallenstein (2001, Ch. 16). For simplicity we assume 
here that N\ = -/V 2 = N and mi = m 2 = m. For 1 < i\,i2 < N — m + 1 
define the events 

(ii+m— li2+m— 1 \ 

£ E Xij>k\. (3.3) 

i=h 3=12 / 

Let T = {(ii, i 2 ); l<ii<iV-m + l,l<i2<JV-m + l}, denote the 
index set of a collection of the integer valued random variables {I Q ;a € T}, 

where 

_ f 1, if Y a > k 

a ~ { 0, otherwise.. ^ ; 

We are interested in approximating the distribution of a two dimensional mul- 
tiple scan statistic 

For 1 < j < m + 1, let 

^.^Pf]^ . (3.5) 

Under quite general conditions the distribution of ]C a gr I<* converges to the 
Poisson distribution with mean Ai, where 

Ax = EC£l a ) = (N-m+l)\l-q m ). (3.6) 

aer 

(Darling and Waterman 1986). This Poisson.approximation for the special case 
of k — m 2 has been discussed in Barbour, Chryssaphinou and Roos (1995), 
Koutras, Papadopoulos and Papastavridis (1993), and Roos (1994). The Pois- 
son approximation is not expected to perform well when k < m 2 , since the 
events {(<5(a) > k);a G T} tend to clump. Employing a local declumping 
approach, Chen and Glaz (1996) derived a more accurate Poisson-type approx- 
imation: 

P{U,m > 1) = P{S m ,m > fe) « 1 - exp(-Aj), (3.7) 

where 

At = 1 - q 2m -2 + (N-2m + 2)(N - m + l)(<7 2m -2 - <Z 2m -i)- (3-8) 

In this article we investigate the performance of the following Poisson-type 
approximation: 



n-l 



e A iA^ 



P(£m,m > n) « 1 - E :pk (3.9) 



i=0 
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A compound Poisson approximation for P(£ m , m > n) presented below is 
based on Roos (1993 and 1994): 

n-l / 5 ./jA 5 

p(e>n)«i-x; e II# -p(-E^). 

j=0 \/9i+2/3 a +3/3 3 +4/34+5^5=j»=l / »=1 

(3.10) 
where for 1 < z < 5 : 



Au = t(1 - 9™) {47n,i + 4(AT - m - l)7r 2>i + (iV - m + l) 2 7r 3 ,i} , 

Tl I i = i > {/l I 2 + /2,l=t-l|/l,l = l}, 
7T 2 ,i = P {/l,l + / 2 ,2 + H,l = % ~ l\h,l = 1} , 

and 

7T3,i = P {/l,2 + 12,1 + ^2,3 + ^3,2 = * ~ 1|^2,2 = 1} • 

In Section 4, Tables 5 and 6, we present numerical results for these Poisson- 
type and compound Poisson approximations. 

We know present approximations for the multiple scan statistic given that 
the total number of observed events X^_i Yli=i ^i,j — o, is known. For 
I < ii,i2 < N — m + I define the events 



'ii+m—l i2+ra— 1 

K,n = [ E £ Xij * k 



N N 
3=1 t=l 



Let 

occurs 



aW ~ \ 0, otherwise, 



where a € T. The conditional multiple scan statistic considered here is given 
by 

€m,m(a) = E 7 a( a )- 

oer 
For 1 < j < m+ 1, let 

q m+j -i(a) = P[f]A 



*c 

u 



^i=l 



A Poisson-type approximation for P(£,m t m(a) > n) is obtained from Equa- 
tions (3.9) and (3.8) by replacing the terms ifem-i anc ^ l2m-2 w ^ m 92m-i( a ) 
and g 2m _ 2 (a), respectively. 
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For 1 < i < 5 let 

Aii(a) = t(1 - q m (a)){47r 1)i (a)+4(iV - m - l)ir 2 ,i(a) 
i 

+(iV-m + l) 2 7r 3 ,j(a)}, 
7Ti,i(a) = P {/1.2(a) + / 2 ,i(o) = t - l|/i,i(o) = 1} , 

7T2,i(o) = P{/i,i + / 2 ,2 + h,\ = » - l|/2,i(a) = 1} , 
and 

7r 3l <(a) = P{Ji, 2 (a) + / 2 ,i(a) + 7 2 , 3 (a) + /3.2(a) = i - l|J 2 , 2 (a) = 1} . 

A compound Poisson approximation for P(£ m ,m(a) > n) is obtained from 
Equation (3.10) by replacing the terms Aij, ix\ t i, 7r 2i j and 7^ with \u(a), 
7Tl,i( a )) 7T2,t(o) and 7^(0), respectively. Numerical results for these approxi- 
mations are given in Section 4,Tables 7 and 8. 

4.4 Numerical Results 

In this section we present numerical results for approximations and inequal- 
ities for the multiple scan statistics discussed in this article. In Tables 1-8 the 
improved Poisson-type approximations are denoted by ImPoi, while the com- 
pound Poisson approximations are denoted by ComPoi. The Bonferroni-type 
inequalities considered in this article have not performed well. Numerical re- 
sults for these inequalities are presented in Table 1 and they are denoted by 
LBound and UBound, respectively. In Tables 1-8, P(o > n) is an approxi- 
mation for the tail probability of an appropriate multiple scan statistic based 
on a simulation with 10,000 trials. In Tables 1-2, Poisson-type and compound 
Poisson approximations, as well as the Bonferroni-type inequalities, are eval- 
uated using an algorithm discussed in Glaz and Naus (1991). The quantities 
<jj, m < j < 2m — 1, needed for evaluating Poisson-type and compound 
Poisson-type approximations for the multiple scan statistic £(a), Tables 3-4, 
are obtained from a simulation with 100,000 trials of sequences of 2m — 1 iid 
binomial or Poisson random variables. The quantities q-, m < j < 2m — 1, 
needed for evaluating Poisson-type and compound Poisson-type approxima- 
tions for the multiple scan statistics £ m , m and £ m , m ( a )> Tables 5-8, are obtained 
from a simulation with 100,000 trials of sequences of (2m — 1) x (2m — 1) 
iid binomial or Poisson random variables. 
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Table 1. Approximations and Bounds for £. Binomial Model. 

N m p L k n P(£ > n) ImPoi ComPoi LBound UBound 
100 10 .05 5 : 



.10 5 10 



11 



12 



1 


0.8840 


0.7978 


0.8973 


0.6135 


1.0000 


2 


0.8371 


0.4746 


0.8420 


0.1768 


1.0000 


3 


0.7923 


0.2163 


0.7851 


0.1593 


1.0000 


4 


0.7446 


0.0786 


0.7277 


0.1414 


1.0000 


5 


0.6970 


0.0236 


0.6710 


0.1232 


1.0000 


1 


0.5972 


0.5521 


0.6450 


0.3818 


1.0000 


2 


0.5123 


0.1923 


0.5350 


0.0189 


1.0000 


3 


0.4383 


0.0479 


0.4409 


0,0137 


1.0000 


4 


0.3659 


0.0092 


0.3614 


0.0093 


0.7832 


5 


0.3121 


0.0014 


0.2948 


0.0058 


0.6472 


1 


0.2847 


0,2738 


0,3201 


0.1808 


1.0000 


2 


0.2143 


0.0415 


0.2259 


0.0010 


0.5128 


3 


0.1605 


0.0043 


0,1591 


0.0005 


0.3495 


4 


0.1 192 


0.0003 


0.1118 


0.0001 


0.2679 


5 


0.089! 


0.0000 


0.0784 


0.0000 


0.2189 


1 


0.4971 


0.4650 


0.5540 


0.3079 


1.0000 


2 


0.3932 


0.1304 


0.4240 


0.0026 


1.0000 


3 


0.3073 


0.0257 


0.3221 


0.0016 


0.7125 


4 


0.2415 


0.0039 


0.2430 


0.0010 


0.5399 


5 


0.1925 


0.0005 


0.1823 


0,0008 


0.4364 


1 


0.2635 


0.2489 


0.2947 


0.1612 


0.8145 


2 


0.1821 


0.0339 


0.1938 


0.0004 


0.41 15 


3 


0.1331 


0.0032 


0.1270 


0.0003 


0.2772 


4 


0.0967 


0.0002 


0.0830 


0.0002 


0.2100 


S 


0,0680 


0.0000 


0.0541 


0.0001 


0.1697 


I 


0.1110 


0.1082 


0.1253 


0.0702 


0.2836 


2 


0.0695 


0.0061 


0.0718 


0.0001 


0.1433 


3 


0.0452 


0.0002 


0.0411 


0.0001 


0.0965 


4 


0.0295 


0.0000 


0.0235 


0.0000 


0.0731 


5 


0.0187 


0.0000 


0.0135 


0.0000 


0.0470 



From Tables 1-8 it is evident that compound Poisson approximations are 
more accurate than the Poisson-type approximations investigated in this article. 
The compound Poisson approximations have performed well, especially in the 
one dimensional case. In the two dimensional case these approximations were 
not as accurate as one would like them to be. There is a need for further 
research to derive accurate approximations for multiple scan statistics. 
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Table 2. Approximations for f . Poisson Model. 



N m k n P(£>n) ImPoi ComPoi 



100 10 .10 3 



1 


0.7665 


0.6928 


0.7728 


2 


0.7170 


0.3303 


0,7074 


3 


0.6707 


0.1163 


0,6455 


4 


0,6273 


0.0321 


0.5874 


5 


0.5776 


0.0072 


0.5333 


1 


0.3520 


0,3309 


0.3718 


2 


0.2954 


0.0621 


0.2941 


3 


0.2469 


0.0080 


0.2325 


4 


0.2024 


0.0008 


0.1836 


5 


0.1633 


0.0001 


0.1449 


I 


0.0960 


0.0953 


0.1048 


2 


0.0720 


0.0047 


0.0720 


3 


0.0538 


0.0002 


0.0494 


4 


0.0399 


0.0000 


0.0340 


5 


0.0282 


0.0000 


0.0234 


1 


0.0208 


0.0195 


0.0209 


2 


0.0137 


0.0002 


0.0128 


3 


0.0099 


0.0000 


0.0079 


4 


0.0061 


0.0000 


0.0048 


5 


0.0044 


0.0000 


0.0030 



4.5 Concluding Remarks 

From the numerical results presented in this article it is evident that further 
research has to be conducted in the area of multiple scan statistics, especially 
in the multi-dimensional case. Approximations for the distribution of multiple 
scan statistics for continuous data also presents many challenging problems. 
Modeling and statistical inference of spatial data is one the most active re- 
search areas in probability and statistics. It has many applications in science 
and technology including: anthropology, archaeology, astronomy, ecology, en- 
vironmental science, epidemiology, geology, image analysis, meteorology, re- 
connaissance and urban and regional planning. The use of spatial scan statis- 
tics in two or higher dimensional regions have been discussed among others in 
Wallenstein, Gould and Kleinman (1989), Priebe, Olson, Healy (1997), Kull- 
dorff (1999), Chan and Lai (2000), Siegmund and Yakir (2000), Glaz, Naus 
and Wallenstein (2001), Priebe and Chen (2001), Priebe, Naiman and Cope 
(2001). Multiple scan statistics are of great importance in this area of research 
as well. More work is needed to be done for deriving accurate approximations 
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for the distribution of multiple scan statistics used in statistical inference for 
spatial data. 



Table 3. Approximations for £(a). Binomial Model. 



N m L a 
100 5 5 25 



20 5 25 



10 5 50 10 



II 



n 


P(e(o) > n) 


ImPoi 


ComPoi 


1 


0.6654 


0.6887 


0.6982 


2 


0.5207 


0.3254 


0.5165 


3 


0.3742 


0.1134 


0.3668 


4 


0.2477 


0.0310 


0.2512 


5 


0.0901 


0.0069 


0.1352 


1 


0.2702 


0.2605 


0.2635 


2 


0.1493 


0.0373 


0.1352 


3 


0.0730 


0.0037 


0.0689 


4 


0.0302 


0.0003 


0.0351 


5 


0.0100 


0.0000 


0.0179 


1 


0.6268 


0.5167 


0.5903 


2 


0.5080 


0.1653 


0,4672 


3 


0.4042 


0.0375 


0.3672 


4 


0.3215 


0.0066 


0.2868 


5 


0.2448 


0.0009 


0.2228 


1 


0.2351 


0.2121 


0.2349 


2 


0.1590 


0.0243 


0.1588 


3 


0.1084 


0.0019 


0.1072 


4 


0.0721 


0.000! 


0.0723 


5 


0.0449 


0.0000 


0.0487 


1 


0.4845 


0.3988 


0.4636 


2 


0.3544 


0.0929 


0.3391 


3 


0.2565 


0.0151 


0.2466 


4 


0.1870 


0.0019 


0.1784 


5 


0.1347 


0.0002 


0.1285 


1 


0.2068 


0.1841 


0.2064 


2 


0.1314 


0.0181 


0.1279 


3 


0.0847 


0.0012 


0,0790 


4 
5 


0.0555 
0.0326 


0.0001 
0.0000 


0.0488 
0.0301 
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Table 4. Approximations for £(a). Poisson Model. 



JV 



100 



10 



10 10 



R 


/>(£(«) > n) 


ImPoj 


ComPoi 


1 


0.6330 


0.5155 


0.6179 


2 


0.5319 


0.1644 


0.521 1 


3 


0.4234 


0.0372 


0.3855 


4 


0,3044 


0.0065 


0.2772 


5 


0.1780 


0.0009 


0.2231 


1 


0.9116 


0.6779 


0.7204 


2 


0,8755 


0.3130 


0.6664 


3 


0.8281 


0.1063 


0.6157 


4 


0.7795 


0.0282 


0.5683 


5 


0.7254 


0.0061 


0.5241 


1 


0.2023 


0.1697 


0.1728 


2 


0. L683 


0.0153 


0.1403 


3 


0.1367 


0.0009 


0.1141 


4 


0.1030 


0.0000 


0.0928 


5 


0.0784 


0.0000 


0.0756 


1 


0.3123 


0.2270 


0.2805 


2 


0.2490 


0.0280 


0.2207 


3 


0.1947 


0.0023 


0.!736 


4 


0.1469 


0.0001 


0.1366 


5 


0.1051 


0.0000 


0.1075 


1 


0.0472 


0.0441 


0.0498 


2 


0.0350 


0.0010 


0.0346 


3 


0.0232 


0.0000 


0.0240 


4 


0.0152 


0.0000 


0.0167 


5 


0.0098 


0.0000 


0.0116 
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Table 5. Approximations £m, m - Binomial Mode 



N m L p k n P(£ m ,m > n) ImPoi ComPoi 



25 5 5 .05 15 



16 



10 5 .05 39 



100 10 5 .05 46 



47 



1 


0.2461 


0.2954 


0.3008 


2 


0.1447 


0.0487 


0.1646 


3 


0.0890 


0.0055 


0.0552 


4 


0.0591 


0.0005 


0.0233 


5 


0,0375 


0.0000 


0.0094 


1 


0.1027 


0.1232 


0.1319 


2 


0.0527 


0,0079 


0.07 10 


3 


0.0301 


0.0O03 


0.0235 


4 


0.0196 


0.0000 


0.0083 


5 


0.0121 


0.0000 


0.0029 


1 


0.1846 


0.1976 


0.2645 


2 


0.1373 


0.0210 


0.2033 


3 


0.1098 


0.0015 


0.1470 


4 


0.0917 


0.0001 


0.0897 


5 


0.0767 


0.0000 


0.0595 


1 


0.1585 


0.2449 


0.2334 


2 


0.0998 


0.0328 


0.1012 


3 


0.0698 


0.0030 


0.0455 


4 


0.0507 


0.0002 


0.0168 


5 


0.0360 


0.0000 


0.0075 


1 


0.0876 


0.1351 


0.1022 


2 


0.0526 


0.0096 


0.0715 


3 


0.0325 


0.0005 


0.0559 


4 


0.0231 


0.0000 


0.0261 


5 


0.0163 


0.0000 


0.0132 
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Table 6. Approximations for £ m ,m. Poisson Model. 



JV m 6 k n P(£ m ,m > n) ImPoi ComPoi 



25 5 .25 14 



15 



16 



100 5 .50 29 



30 



I 


0.5274 


0.6526 


0.6546 


2 


0.3854 


0.2853 


0.5013 


3 


0.2910 


0.0911 


0.3440 


4 


0.2258 


0.0227 


0.2247 


5 


0.1744 


0.0046 


0.1672 


1 


0.2964 


0.3698 


0.3754 


2 


0.1825 


0.0788 


0.2511 


3 


0.1173 


0.0116 


0.0930 


4 


0.0809 


0.0013 


0.0451 


5 


0.0577 


0,0001 


0.0184 


1 


0.1448 


0.1708 


0.1418 


2 


0.0805 


0.0155 


0.0654 


3 


0.0445 


0.0010 


0.0273 


4 


0.0277 


0.0000 


0.0103 


5 


0.0181 


0.0000 


0.0019 


1 


0.2307 


0.2715 


0.3012 


2 


0.1064 


0.0407 


0.1174 


3 


0.0546 


0.0042 


0.0285 


4 


0.0297 


0.0003 


0.0187 


5 


0.0157 


0.0000 


0.0031 


1 


0.1061 


0.1246 


0.0968 


2 


0.0402 


0.0081 


0.0233 


3 


0.0154 


0.0004 


0.0021 


4 


0.0071 


0.0000 


0.0015 
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Table 7. Approximations for £ m<m (a). Binomial Model.. 





N 


m L 


a 


k 


n 


^(^m.mfa) > n) 


ImPoi 


ComPoi 


25 


5 5 


25 


4 


1 


0.8659 


0.9263 


0.9438 










2 


0.7946 


0.7341 


0.9223 










3 


0.7108 


0.4835 


0.8386 










4 


0.6347 


0.2656 


0.7707 










5 


0.5542 


0.1236 


0.6763 




5 




5 


t 


0.3182 


0.3525 


0.2906 










2 


0.2323 


0.0711 


0.2099 










3 


0.1522 


0,0099 


0.1462 










4 


0.1137 


0.0011 


0.0690 










5 


0.0813 


0.0001 


0.0298 








6 


1 


0.0598 


0.1337 


0.0468 










2 


0.0337 


0.0016 


0.0305 










3 


0.0179 


0.0000 


0.0144 










4 


0.0113 


0,0000 


0.0055 










5 


0.0075 


0.0000 


0.0001 


25 


5 5 


50 


7 


1 


0,4098 


0.4441 


0.4958 










2 


0.2817 


0.1177 


0.3788 










3 


0,1899 


0.0219 


0.2155 










4 


0.1363 


0.0031 


0.1255 










5 


0,0949 


0.0004 


0.0666 








8 


1 


0.1172 


0.1284 


0.1234 










2 


0.0645 


0.0086 


0.0545 










3 


0,0347 


0.0004 


0.0195 










4 


0.0200 


0.0000 


0.0076 










5 


0.0124 


0.0000 


0.0025 


too 


5 5 


100 


5 


1 


0.8695 


0.9442 


0.8716 










2 


0.7895 


0.7830 


0.8078 










3 


0.6944 


0.5506 


0.6992 










4 


0.6117 


0,3270 


0.5867 










5 


0.5263 


0.1658 


0.4759 








6 


1 


0.3052 


0.3595 


0.3210 










2 


0.2041 


0.0741 


0.2286 










3 


0.1290 


0.0106 


0.1037 










4 


0.0894 


0.0012 


0.0541 










5 


0.0585 


0.0001 


0.0247 
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Table 8. Approximations for ^ m ,m(a). Poisson Model. 



N m a k n P(£ m ,m(a) > n) ImPoi ComPoi 



25 5 300 22 



23 



24 



100 10 1000 24 



25 



1 


0.5937 


0.6718 


0.6586 


2 


0.4241 


0.3061 


0.4853 


3 


0.3052 


0.1024 


0.3168 


4 


0.2231 


0.0268 


0.2009 


5 


0.1635 


0.0057 


0.1186 


1 


0,3810 


0.4291 


0.3916 


2 


0.2148 


0.1091 


0.2336 


3 


0.0000 


0.0194 


0.1076 


4 


0.0000 


0.0026 


0.0519 


5 


0.0000 


0.0003 


0.0221 


! 


0.2314 


0.2427 


0.2785 


2 


0.1122 


0.0322 


0.1667 


3 


0.0000 


0.0029 


0.0474 


4 


0.0000 


0.0002 


0.0206 


5 


0.0000 


O.OOOO 


0.0065 


1 


0.2197 


0.3119 


0.2868 


2 


0.1021 


0.0547 


0.1660 


3 


0.0000 


0.0066 


0.0364 


4 


0.0000 


0.0006 


0.0147 


5 


0.0000 


0.0000 


0.0028 


1 


0.1467 


0.0722 


0.1776 


2 


0.0628 


0.0027 


0.1745 


3 


0.0000 


0.0001 


0.1715 


4 


0.0000 


0.0000 


0.0826 


5 


0.0000 


0,0000 


0,0164 
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Abstract Krawtchouk matrices have as entries values of the Krawtchouk polynomials for 

nonnegative integer arguments. We show how they arise as condensed Sylvester- 
Hadamard matrices via a binary shuffling function. The underlying symmetric 
tensor algebra is then presented. 

To advertise the breadth and depth of the field of Krawtchouk polynomials / ma- 
trices through connections with various parts of mathematics, some topics that 
are being developed into a Krawtchouk Encyclopedia are listed in the concluding 
section. Interested folks are encouraged to visit the website 

http: //chanoir .math. siu.edu/Kravchuk/index.html 

which is currently in a state of development. 



5.1 What are Krawtchouk matrices 

Of Sylvester-Hadamard matrices and Krawtchouk matrices, the latter are 
less familiar, hence we start with them. 

DEFINITION 1 The N th -order Krawtchouk matrix K^ is an (N+ 1 )x(7V+ 
1) matrix, the entries of which are determined by the expansion: 

N 

(1 + ^(1-^ = 2^?° (l.D 

i=0 

Thus, the polynomial G(v) — (1 + v)^ - -? (1 — v)i is the generating function 
for the row entries of the j th column of K^ . Expanding gives the explicit 
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values of the matrix entries: 



<> = £<-»> (0(17) 



k 
where matrix indices run from to N. 

Here are the Krawtchouk matrices of order zero, one, and two: 



K (o) = [ i ] K 0) = 



1 1 
1 -1 



K m 



1 


1 


1 


2 





-2 


1 


-1 


1 



The reader is invited to see more examples in Table 1 of the Appendix. 

The columns of Krawtchouk matrices may be considered generalized bino- 
mial coefficients. The rows define Krawtchouk polynomials: for fixed order 
N, the i til Krawtchouk polynomial takes its corresponding values from the i th 
row: 

k i (j,N) = K { i P (1.2) 

One can easily show that ki(j, N) can be given as a polynomial of degree i in 
the variable j. For fixed N, one has a system of N + 1 polynomials orthogonal 
with respect to the symmetric binomial distribution. 

A fundamental fact is that the square of a Krawtchouk matrix is proportional 
to the identity matrix. 

(KW) 2 = 2 N -I 

This property allows one to define a Fourier-like Krawtchouk transform on in- 
teger vectors. For more properties we refer the reader to [Feinsilver, 2001]. In 
the present article, we focus on Krawtchouk matrices as they arise from cor- 
responding Sylvester-Hadamard matrices. More structure is revealed through 
consideration of symmetric tensor algebra. 

Symmetric Krawtchouk matrices. When each column of a Krawtchouk 
matrix is multiplied by the corresponding binomial coefficient, the matrix be- 
comes symmetric. In other words, define the symmetric Krawtchouk matrix 
as 

S (N) = K (N) B (N) 

where B^ N > denotes the (N + 1) x (N + 1) diagonal matrix with binomial 
coefficients, B^ ' = ( i ) , as its non-zero entries. 
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s& = 



1 1 

3 1 

3 -1 
1 -1 



1 


1 " 




-1 


-3 




-1 


3 




1 


-1 





1 


0' 







3 







3 







1 





3 

3 

-3 

-3 



3 
-3 
-3 

3 



1 

-3 
3 

-1 



Some symmetric Krawtchouk matrices are displayed in Table 2 of the Ap- 
pendix. A study of the spectral properties of the symmetric Krawtchouk ma- 
trices was initiated in work with Fitzgerald [Feinsilver & Fitzgerald, 1996]. 
Background note. Krawtchouk's polynomials Krawtchouk polynomial were 
introduced by Mikhail Krawtchouk in the late 20's [Krawtchouk, 1929; Krawt- 
chouk, 1933]. The idea of setting them in a matrix form appeared in the 1985 
work of N. Bose [Bose, 1985] on digital filtering in the context of the Cayley 
transform on the complex plane. For some further development of this idea, 
see [Feinsilver, 2001]. 

The Krawtchouk polynomials play an important role in many areas of math- 
ematics. Here are some examples: 

■ Harmonic analysis. As orthogonal polynomials, they appear in the 
classic work by Szego [Sze, 1959]. They have been studied from the 
point of view of harmonic analysis and special functions, e.g., in work 
of Dunkl [Dunkl, 1976; Dunkl, 1974]. Krawtchouk polynomials maybe 
viewed as the discrete version of Hermite polynomials (see, e.g., [Atak- 
ishiyev, 1997]). 



Statistics. Among the statistics literature we note particularly Eagleson 
[Eagelson, 1969] and Vere- Jones [Vere-Jones, 1971]. 



Combinatorics and coding theory. Krawtchouk polynomials are es- 
sential in Mac Williams' theorem on weight enumerators [Levenstein, 
1995; Mac Williams & Sloane, 1977], and are a fundamental example in 
association schemes [Delsarte, 1972; Delsarte, 1973; Delsarte, 1973a]. 

Probability theory. In the context of the classical symmetric random 
walk, it is recognized that Krawtchouk's polynomials are elementary 
symmetric functions in variables taking values ±1. It turns out that the 
generating function (1.1) is a martingale in the parameter N [Feinsilver 
& Schott, 1991]. 
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■ Quantum theory. Krawtchouk matrices interpreted as operators give 
rise to two new interpretations in the context of both classical and quan- 
tum random walks [Feinsilver, 2001]. The significance of the latter in- 
terpretation lies at the basis of quantum computing. 

Let us proceed to show the relationship between Krawtchouk matrices and 
Sylvester-Hadamard matrices. 



5.2 Krawtchouk matrices from Hadamard matrices 

Taking the Kronecker (tensor) product of the initial matrix 



H = 



with itself N times defines the family of Sylvester-Hadamard matrices. 

(For a review of Hadamard matrices, see Yarlagadda and Hershey [Rao & 
Hershey, 1997].) 

NOTATION 2 Denote the Sylvester-Hadamard matrices, tensor (Kronecker) 
powers of the fundamental matrix H by 



H (N) = H ®N = H ® Hl 



H 



N times 



The first three Sylvester-Hadamard matrices are H^,H^ and H^ given 
by: 



















• 





• 





• 


o 


• 


o 
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• 


• 


• 




• 


• 








• 


• 


o 





• 


• 
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o 


• 
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• 








• 


• 








• 


• 


o 


> 


• 


• 
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• 


• 


• 


• 


o 















• 








• 




• 





• 


o 


o 


• 





• 












• 


• 














• 


• 


















• 


o 





• 


o 


• 


• 


o 



where, to emphasize the patterns, we use • for 1 and o for -1. See Table 3 of 
the Appendix for these matrices up to order 5. 

For N = 1, the Hadamard matrix coincides with the Krawtchouk matrix: 
H^ 1 ' = K^. Now we wish to see how the two classes of matrices are re- 
lated for higher N. It turns out that appropriately contracting (condensing) 
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Hadamard-Sylvester matrices yields corresponding symmetric Krawtchouk ma- 
trices. 

The problem is that the tensor products disperse the columns and rows that 
have to be summed up to do the contraction. We need to identify the right sets 
of indices. 



DEFINITION 3 Define the binary shuffling function as the function 

w: N^N 

giving the "binary weight" of an integer. That is, let n = ]T] fc d*.2 fe be the 
binary expansion of the number n. Then w(n) = Ylk^k' tne number of ones 
in the representation. 

Notice that, as sets, 

w({0,l,.--,2 N -l}) = {0,l,...,N} 

Here are the first 16 values of w listed for the integers running from through 
2 4 - 1 = 15: 

0112122312232334 

The shuffling function can be defined recursively. Set 
itf(0) = and 

w(2 N + k) = w(k) + 1 (2.1) 

for < k < 2 N . One can thus create the sequence of values of the shuffling 
function by starting with and then appending to the current string of values a 
copy of itself with values increased by 1: 

-» 01 -» 0112 -> 01121223 -♦ ... 

Now we can state the result; 



THEOREM 4 Symmetric Krawtchouk matrices are reductions of Hadamard 
matrices as follows: 



tu(a)=i 



Example. Let us see the transformation for H^ — > S^ (recall that • stands 
for 1, and o for -1). Applying the binary shuffling function to ff( 4 ) , mark the 
rows and columns accordingly: 
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0112122312232334 



o*o*o*o*o*o*o*o 

• oo**oo**oo**oo 
oo**oo**oo*»oo« 

• ••oooo»«««oooo 
o«oo*o**o*oo*o« 

• oooo***«oooo«* 
oo*o**o*oo*o**o 

• ••••••oooooooo 

o*o*o*oo*o*o*o* 

• oo»»oooo»»oo«» 
oo*«oo*o**oo**o 

• ••oooooooo**** 
o*oo*o*o*o**o*o 

• oooo**oo****oo 
oo*o**oo**o*oo* 



The contraction is performed by summing columns with the same index, then 
summing rows in similar fashion. One checks from the given matrix that in- 
deed this procedure gives the symmetric Krawtchouk matrix S^ 4 ' : 



S^ = 








1 


2 


3 


4 





(1 


4 


6 


4 


1\ 


1 


4 


8 





-8 


-4 


2 


6 





-12 





6 


3 


4 


-8 





8 


-4 


4 


1 


-4 


6 


-4 


1 



Now we give a method for transforming the N (symmetric) Krawtchouk 
matrix into the N + 1 st . 

DEFINITION 5 The square contraction r(M) of a In X 2n matrix M a b, 
1 < a, b < In, is the (n + 1) x (n + 1) matrix with entries 

{rM)ij = Yl M ob 

a=2i, 2i+l 
b=2j, 2j+l 

< i, j < n, where the values of Af a 6 with a or 6 outside of the range 
(1, . . . , In) are taken as zero. 
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THEOREM 6 Symmetric Krawtchouk matrices satisfy: 

5 (N + i) = r(s ,(Ar )(2)F) 

with S® = H. 

Example. Start with symmetric Krawtchouk matrix of order 2: 



S& = 



Take the tensor product with H: 

1 
1 
2 
2 
1 
1 



1 2 

2 
1 -2 



S< 2 > ® H = 



1 
-1 

2 
-2 

1 
-1 



2 
2 


-2 
-2 



1 
-2 

1 



2 
-2 




-2 

2 



1 1 
1 -1 



-2 
-2 

1 
1 



-2 
2 

1 
-1 



surround with zeros and contract: 



r(S i2) ®H) = r 



[00 














1 


1 


1 


2 


2 


1 


1 


1 


-1 


2 


-2 


1 


-1 


2 


2 








-2 


-2 


2 


-2 








-2 


2 


1 


1 


-2 


-2 


1 


1 


1 


-1 


-2 


2 


1 


-1 





















1 

3 
3 

1 



3 

3 

-3 

-3 



3 1 

-3 -3 

-3 3 

3 -1 



COROLLARY 7 Krawtchouk matrices satisfy: 

where B is the diagonal binomial matrix. 

Note that starting with the 2 x 2 identity matrix, /, set 1^ = I, 

J(N+1) = r ( 7 (iV) g, fy Then) in fact) 7 (AT) = S (AT)_ 

Next, we present the algebraic structure underlying these remarkable prop- 
erties. 
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5.3 Krawtchouk matrices and symmetric tensors 

Given a d-dimensional vector space V over R, one may construct a 
d^-dimensional space V® N , the iV-fold tensor product of V, and, as well, 
a ( jy 1 ) -dimensional symmetric tensor space V® sN . There is a natural 
map 

symm: V® N — -» V® sN 

which, for homogeneous tensors, is defined via 

symm (v ® w ® . . .) = symmetrization of (v <g> u; <8> . . .) 

For computational purposes, it is convenient to use the fact that the symmet- 
ric tensor space of order N of a d-dimensional vector space is isomorphic to 
the space of polynomials in d variables homogeneous of degree N. 

Let {ei, e2, . . . e^} be a basis of V. Map e^ to Xi, replace tensor products by 
multiplication of the variables, and extend by linearity. For example, 

2ei <g> e2 + 3e2 e\ — le% <g> e2 — > 5xia;2 — 7x2^3 

thus identifying basis (elementary) tensors in V® N that are equivalent under 
any permutation. 

This map induces a map on certain linear operators. Suppose A 6 End(F) 
is a linear transformation on V. This induces a linear transformation An = 
A® N G End(V® JV ) defined on elementary tensors by: 

An(v <8>w® ...) = A(v) <g> i4(io) <g> . . . 

Similarly, a linear operator on the symmetric tensor spaces is induced so that 
the following diagram commutes: 

y®N ^H ^ y®N 



symm 



symm 



A N 
y®*N ^ y® s N 

This can be understood by examining the action on polynomials. We call An 
the symmetric representation of A in degree N. Denote the matrix elements of 
An by A mn . If A has matrix entries Aij , let 



Vi — / ,, Aij x 
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It is convenient to label variables with indices from to S = d — 1. Then the 
matrix elements of the symmetric representation are defined by the expansion: 



?/ m ° 



S 



' 2/<5 — / j A mn x Q • ■ • x 

n 

with multi-indices m and n homogeneous of degree N. 

Mapping to the symmetric representation is an algebra homomorphism, i.e., 

AB = AB 
Explicitly, in matrix notation, (AB) mn = J2 (A) mr {B) rn . 

r 

Now we are ready to state our result 

PROPOSITION 4 For each N > 0, the symmetric representation of the JV th 
Sylvester-Hadamard matrix equals the transposed N Krawtchouk matrix: 

(Bs)a = Kj? . 

Proof. Writing (x, y) for (xq, xi), we have in degree N for the k th component: 



{x + y) N - k {x-y) k = Y,H k ix N ~ l 



-v 



Substituting x = 1 yields the generating function (1.1) for the Krawtchouk 
matrices with the coefficient of y l equal to K\ k . Thus the result. ■ 



Insight into these correspondences can be gained by splitting the fundamen- 
tal Hadamard matrix H (= K^ 1 ') into two special symmetric 2x2 operators: 



F = 


1 
10 


< 


n 


1 





-1 


so that 


H=F+G= 


1 1 
1 -1 




One can readily check that 


F 2 = G 2 = / 


FH -- 


= HG 


and 


G 


H = 


-HF 



(3.1) 
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The first of the second pair of equations may be viewed as the spectral decom- 
position of F and we can interpret the Hadamard matrix as diagonalizing F 
into G. Taking transposes gives the second equation of (3.1). 

Now we proceed to the interpretation leading to a symmetric Bernoulli 
quantum random walk ([Feinsilver, 2001]). For this interpretation, the Hilbert 
space of states is represented by the JV th tensor power of the original 2-dimen- 
sional space V, that is, by the 2 Ar -dimensional Hilbert space V® N . Define the 
following linear operator on V® N : 

X F = F <g> I <g> • • ■ <g> / 

+I®F®I®---®I 
+ ... 

+1 ® I ® • • • <8> F 
= /1 + /2 + .■• + /< + •■• + /* 

each term describing a "flip" at the i th position (cf. [Hess, 1954; Siegert, 
1949]). Analogously, we define: 

X G = G <g> / <g> ■ • • <8> J 

+I®G®I®---®I 

-f ... 

+/ <8> / <g> • • ■ <g> G 

= 9i + 92 + ■ ■ ■ + 9i + ■ ■ ■ + 9N 

From equations (3.1) we see that our X-operators intertwine the Sylvester- 
Hadamard matrices: 

X F HW = H (n) Xg and XgII (n) = h {n) Xf 

Since products are preserved in the process of passing to the symmetric tensor 
space, we get 

X F H N = H N X G and X G H N = H N X F (3.2) 

the bars indicating the corresponding induced maps. 

We have seen in Proposition 4 how to calculate H n from the action ofH on 
polynomials in degree N. For symmetric tensors we have the components in 
degree N, namely x N ~ k y k , for < k < N, where for convenience we write x 
for xq and y for X\. Now consider the generating function for the elementary 
symmetric functions in the quantum variables fj. This is the TV- fold tensor 
power 

F N {t) = (/ + tF) 9N = I® N + tX F + ■ ■ ■ 
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noting that the coefficient oft is Xf- Similarly, define 

g N (t) = (I + tG)® N = I® N + tX G + --- 

From (/ + tF)H = H{I + tG) we have 

F n hW = H^Gn and 7jv #;v = H N On 

The difficulty is to calculate the action on the symmetric tensors for operators, 
such as Xp, that are not pure tensor powers. However, from JT/v(i) and Gw(t) 
we can recover Xp and Xq via 



X F = 



dt 



(I + tF)® N , 



X G = 



t=o 



d_ 
dt 



(I + tG)® N 



(=0 



with corresponding relations for the barred operators. Calculating on polyno- 
mials yields the desired results as follows. 



I + tF = 



1 t 
t 1 



I + tG 



1 + t 
1-t 



In degree N, using x and y as variables, we get the k th component for Xf and 
Xq via 

A 

dt 



t=o 



(x + ty) N ~ k {tx + y) k = (N- k)x N ^ k+1 ^y k+1 + Jb^-( fc - V _1 



and since I + tG is diagonal, 
d 



dt 



{l + t) N - k {l-t) k x N - k y k = 



(N - 2k) x N ~ k y k 



t=Q 



For example, calculations for TV = 4 result in 



X F 



X g = 



4 

10 3 

2 2 

3 1 

4 

4 

2 



0-2 

• 



(3.3) 







-4 



(3.4) 



H 4 = 



4 6 4 1 

2 0-2-1 

0-201 

-2 2-1 

-4 6-4 1 



(3.5) 
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Since Xg is the result of diagonalizing Xp, we observe that 



COROLLARY 8 The spectrum of Xp is N, N 
with the support of the classical random walk. 



,2-N, -N, coinciding 



Remark on the shuffling map. Notice that the top row of (/ + tF)® N is ex- 
actly t w ( k \ where w(k) is the binary shuffling function of section §5.2. Each 
time one tensors with / + tF, the original top row is reproduced, then concate- 
nated with a replica of itself modified in that each entry picks up a factor of t 
(compare with equation (2.1)). And, collapsing to the symmetric tensor space, 
the top row will have entries {^)t k . This follows as well by direct calcula- 
tion of the component matrix elements in degree N, namely by expanding 
(x + ty) N . 

We continue with some areas where Krawtchouk polynomials/matrices play 
a role, very often not explicitly recognized in the original contexts. 



5.4 



Ehrenfest urn model 



Ehrenfest urn model In order to explain how the apparent irreversibility of 
the second law of thermodynamics arises from reversible statistical physics, 
the Ehrenfests introduced a so-called urn model, variations of which have 
been considered by many authors [Kac, 1947; Karlin & McGregor, 1965; Voit, 
1996]. 

We have an urn with N balls. Each ball can be in two states represented 
by, say, being lead or gold. At each time k € N, a ball is drawn at random, 
changed by a Midas-like touch into the opposite state (gold <-> lead) and placed 
back in the urn. The question is of course about the distribution of states — 
and this leads to Krawtchouk matrices. 



Represent the states of the model by vectors in E n+1 , namely by the state of 
k gold balls by 



v fe = [0 



1 



fc tn position 



In the case of, say, N = 3, we have 4 states 



gold balls 
3 lead balls 



1 






1 gold ball 

2 lead balls 




1 









3 gold balls 
lead balls 



(4.1) 







1 
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It is easy to see that the matrix of elementary state change in this case is 




1 





l 

2 

3 " 

o \ 






1 










1 





1 


3 





2 


3 





2 


3 










1 



= l Am - 



and in general, we have the Kac matrix with off-diagonals in arithmetic pro- 
gression 1,2,3, ... descending and ascending, respectively: 



AW = 





N 











1 



N-l 












2 



iV-2 











3 










































N 


1 






It turns out that the spectral properties of the Kac matrix involve Krawtchouk 
matrices, namely, the collective solution to the eigenvalue problem Av = Xv 

A (N) R (N) = R (N) A (N) 

where A^ is the (N+l) x (7V+1) diagonal matrix with entries A^ ' = N—2i 

N 

N-2 (*) 

Af-4 



AW = 



(*) 



2-N 



-N 



the (*)'s denoting blocks of zeros. 



To illustrate, for TV = 3 we have 






1 





" 




3 





2 










2 





3 










1 








1111 








3 1-1-3 




3-1-1 3 




1-1 1-1 




"1 1 1 1 " 




" 3 


3 1-1-3 




1 


3-1-1 3 







1-1 1 - 


1 











" 








-1 








-3 
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To see this in general, we note that, cf. equations (3.3-3.5), these are the 
same operators appearing in the quantum random walk model, namely, we 
discover that A^ = X G , ^4 (iV) = Xp. Now, recalling K^ = H N , taking 
transposes in equation (3.2) yields 

AW K™ = K W A (N) an( j K (N) A (N) = A (N) K (N) 

which is the spectral analysis of A^ N ' from both the left and the right. Thus, 
e.g., the columns of the Krawtchouk matrix are eigenvectors of the Ehrenfest 
model with N balls where the kr 1 column v^ := (K. ^) has corresponding 
eigenvalue X k — (N - 2k) /N. 
Remarks 

1 Clearly, the Ehrenfest urn problem can be expressed in other terms. For 
instance, it can be reformulated as a random walk on an TV-dimensional 
cube. Suppose an ant walks on the cube, choosing at random an edge 
to progress to the next vertex. Represent the states by vectors in Z = 
Z2 x • • • x Z2, N factors. The equivalence of the two problems comes 
via the correspondence of states 

Z 3 [ m a 2 . . . a N ] — ► v w e R N+l 
where w = ^ a^is the weight of the vector calculated in N.see (4.1). 

2 The urn model in the appropriate limit as N —> 00 leads to a diffusion 
model on the line, the discrete distributions converging to the diffusion 
densities. See Kac' article ([Kac, 1947]). 

3 There is a rather unexpected connection of the urn model with finite- 
dimensional representations of the Lie algebra sl(2) = so(2, 1). Indeed, 
introduce a new matrix by the commutator: 

The matrix A is a skew-symmetric version of A. For N = 3, it is 



A = 



0-100 
3 0-20 
2 0-3 
10 



It turns out that the triple A, A and A is closed under commutation, thus 
forms a Lie algebra, namely 

span { A, A, A } S so(2, 1) ^ s/(2,R) 
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with commutation relations 

{A, A] = 2A , [A, A] = 2A, [A, A] = -2A 

5.5 Krawtchouk matrices and classical random walks 

In this section we will give a probabilistic meaning to the Krawtchouk ma- 
trices and illustrate some connections with classical random walks. 



5.5.1 Bernoulli random walk 

Let Xi be independent symmetric Bernoulli random variables taking val- 
ues ±1. Let xn = Xi + • ■ ■ + Xn be the associated random walk starting 
from 0. Now observe that the generating function of the elementary symmetric 
functions in the Xi is a martingale, in fact a discrete exponential martingale: 

N 

Mn = Y[(l + vXi) = J2 v k a k {X x , ...,X N ) 

i=\ k 

where a k denotes the k til elementary symmetric function. The martingale 
property is immediate since each Xi has mean 0. Refining the notation by 

setting a k to denote the /c th elementary symmetric function in the variables 
X\, . . . , Xn, multiplying M/v by 1 + vX^+i yields the recurrence 

(N+l) (N) . (N) v 

a k = a k + 4-1 X N+l 

which, with the boundary conditions a k ' =0, for k > 0, Oq = 1 for all 
n > 0, yields, for k > 0, 

N 
(N+i) _V^ n (i) x- 

j=0 

that is, these are discrete or prototypical iterated stochastic integrals and thus 
the simplest example of Wiener's homogeneous chaoses. 

Suppose that at time N, the number of the Xi that are equal to -1 is j^, 
with the rest equal to +1. Then jn — (N — x^)/2 and Mn can be expressed 
solely in terms of N and xjy, or, equivalently, of N and jn 

M N = (1+ v) N ' JN (l - v) JN = (1 + v)( N+XN V 2 (l - t,)(JV-*iv)/2 



From the generating function for the Krawtchouk matrices, equation (1.1), 
follows 

,(JV) 

-,3N 



M N = J2v>Kg 
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so that as functions on the Bernoulli space, each sequence of random variables 
K> A is a martingale. 

Now we can derive two basic recurrences. From a given column of K^ N \ 
to get the corresponding column in K^ N+1 \ we have the Pascal's triangle re- 
currence: 

This follows in the probabilistic setting by writing M/v+i = (1 + vXn)Mn 
and remarking that for j to remain constant, Xn must take the value +1. The 
martingale property is more interesting in the present context. We have 

<„> - £(<1> +1 |X, X N ) - i (*£«> + *<»«>) 

since half the time Xjv+i is -1, increasing j'jv by 1, and half the time j'n is 
unchanged. Thus, writing j for j/v, 

K W _ I ( K {N+\) , K {N+\)\ 

which may be considered as a 'reverse Pascal'. 

5.5.1.1 Orthogonality. As noted above — here with a slightly simpli- 
fied notation — it is natural to use variables (x, N), with x denoting the posi- 
tion of the random walk after TV steps. Writing K a (x,N) for the Krawtchouk 
polynomials in these variables, cf. equation (1.2), we have the generating func- 
tion 

N 

G{v) = J2 v a K a (x, N) = (l + vf N+x ^ 2 {\ - v)^ N ' x ^ 2 
The expansion 

00 n / \ 

(l- u )»-(l-(l-i2)t,)-« = y;^ T (a) n2 F 1 (" n ' y R) (5.1) 

with (a) n = T(a + n)/r(a), yields the identification as hypergeometric func- 
tions 



'N\ „ f-a,(x-N)/2 

-N 



K a (x,N)= (^Wif" 
The calculation 

(G(v) G(w)) = Y[(l + (v + w)X j + vwXf) = (1 + vw) N 
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exhibits the orthogonality of the K a if one observes that after taking expecta- 
tions only terms in the product vw remain. Thus, the K a are notable for two 
important features: 

1 They are the iterated integrals (sums) of the Bernoulli process. 

2 They are orthogonal polynomials with respect to the binomial distribu- 
tion. 

5.5.2 Multivariate Krawtchouk polynomials 

The probabilistic approach may be carried out for general finite probability 
spaces. Fix an integer d > and d values {£o, ■ • • i &}> with the convention 
S = d — 1. Take a sequence of independent identically distributed random 
variables having distribution P(X = £j) = pj, < j < 5. Denote the mean 
and variance of the X{ by p, and cr 2 as usual. 

For N > 0, we have the martingale 

N 

M N = Y[(l + v{X j -n)) 

7=1 
We now switch to the multiplicities as variables. Set 

N 

n i = 2 1 {Jf*=Ci} 
fc=i 

the number of times the value £j is taken. Thus the generating function 

S N 

G{v) = PJ(1 +f& ~ l*)) n > = E ^^(«o, • ■ • , n s ) 
j=0 a=0 

defines our generalized Krawtchouk polynomials. One quickly gets 

PROPOSITION 5 Denoting the multi-index n = (no, • . • , ns) and by ej the 

standard basis on Z , Krawtchouk polynomials satisfy the recurrence 

K a {n + ej ) = K a (n) + (^ - fi)K a ^(n) 
We also find by binomial expansion 
Proposition 6 

A-«(no,...,n,)= Ellf^fe-^ 
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6 

where |k| = J^ kj. 
j=0 

There is an interesting connection with the multivariate hypergeometric func- 
tions of Appell and Lauricella. The Lauricella polynomials Fb are defined by 



*< T 



(-r)k(b) k k 



■£ Ww" 



with, e.g., r = (n, . . . , r s ), (r) k = {ri) kl (r 2 )k 2 ' • • (rs)k s for multi-index k, 
also s k = Si • • • s s s , andk! = k\\- ■ ■ k$\ . Note that t is a single variable. 
The generating function of interest here is 

a-E^^nd-E^+w)-"- E : Tr B fi<("', ,b s ) 

(5.2) 
a multivariate version of (5.1). 

Proposition 7 Let N = |n|. 7f£ = 0, tf*e«, 

|r|=a 

Proof Let -Uj = vpj£j, bj = — n,, £ = —N, Sj = pj 1 in (5.2), for 1 < j < 6. 

Note that ]jP Uj = u/z, X^j ~ t = N — ( 53 n j) = n o- □ 

i<j<<5 
Orthogonality follows similar to the binomial case: 

PROPOSITION 8 The Krawtchouk polynomials K a (no, . . . , ns) are orthogo- 
nal with respect to the induced multinomial distribution, Infact, 
with N = |n|, 

(K a K p ) = 5 aj3 a 2a {^j 
Proof 
(G(V)G(W)) = W ^ )^o... p n,-Q (1 + (v + uj)(e ._ M) 

= (X] ( p i + ( v + ™M' (f i - M) + uwpj (£j - M) 2 ) ) 

Thus, (G(v) G(w)) — (1 + vwa 2 ) N . This shows orthogonality and yields 
the squared norms as well. q 
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5.6 "Kravchukiana" or the World of Krawtchouk 
Polynomials 

About the year 1995, we held a seminar on Krawtchouk polynomials at 
Southern Illinois University. As we continued, we found more and more prop- 
erties and connections with various areas of mathematics. 

Eventually, by the year 2000 the theory of quantum computing had been 
developing with serious interest in the possibility of implementation, at the 
present time of MUCH interest. Sure enough, right in the middle of everything 
there are our flip operators, su(2), etc., etc. — same ingredients making up the 
Krawtchouk universe. Well, we can only report that how this all fits together 
is still quite open. Of special note is the idea of a hardware implementation 
of a Krawtchouk transform. A beginning in this direction may be found in the 
just-published article with Schott, Botros, and Yang [Botros et al, 2002]. 

At any rate, for the present we list below the topics which are central to 
our program. They are the basis of the Krawtchouk Encyclopedia, still in 
development; we are in the process of filling in the blanks. An extensive web 
resource for Krawtchouk polynomials we recommend is Zelenkov's site: 

http : //www. geocities . com/orthpol/ 

Note that we do not mention work in areas less familiar to us, notably that 
relating to ^-Krawtchouk polynomials, such as in [Steele, 1997]. 

We welcome contributions. If you wish either to send a reference to your 
paper(s) on Krawtchouk polynomials or contribute an article, please contact 
one of us ! 

Our email: pf einsil@math . siu . edu or j kocikomath . siu . edu . 



5.6.1 Krawtchouk Encyclopedia 

Here is a list of topics currently in the Krawtchouk Encyclopedia. 

1 Pascal's Triangle 

2 Random Walks 

■ Path integrals 

■ A, K, and A 

■ Nonsymmetric Walks 
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■ Symmetric Krawtchouk matrices and binomial expectations 

3 Urn Model 

■ Markov chains 

■ Initial and invariant distributions 

4 Symmetric Functions. Energy 

■ Elementary symmetric functions and determinants 

■ Traces on Grassman algebras 

5 Martingales 

■ Iterated integrals 

■ Orthogonal functionals 

■ Krawtchouk polynomials and multinomial distribution 

6 Lie algebras and Krawtchouk polynomials 

■ so(2,l) explained 

■ so(2,l) spinors 

■ Quaternions and Clifford algebras 

■ S and so(2,l) tensors 

■ Three-dimensional simple Lie algebras 
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7 Lie Groups. Reflections 

■ Reflections 

■ Krawtchouk matrices as group elements 

8 Representations 

■ Splitting formula 

■ Hilbert space structure 

9 Quantum Probability and Tensor Algebra 

■ Flip operator and quantum random walk 

■ Krawtchouk matrices as eigenvectors 

■ Trace formulas. MacMahon's Theorem 

■ Chebyshev polynomials 

10 Heisenberg Algebra 

■ Representations of the Heisenberg algebra 

■ Raising and velocity operator. Number operator 

■ Evolution structure. Hamiltonian. 

■ Time-zero polynomials 

11 Central Limit Theorem 

■ Hermite polynomials 

■ Discrete stochastic differential equations 

12 Clebsch-Gordan Coefficients 

■ Clebsch-Gordan coefficients and Krawtchouk polynomials 

■ Racah coefficients 

13 Orthogonal Polynomials 

■ Three-term recurrence in terms of A, K, Lambda 

■ Nonsymmetric case 



1 3 6 RECENTS AD VANCES IN APPLIED PROBABILITY 

14 Krawtchouk Transforms 

■ Orthogonal transformation associated to K 

■ Exponential function in Krawtchouk basis 

■ Krawtchouk transform 

15 Hypergeometric Functions 

■ Krawtchouk polynomials as hypergeometric functions 

■ Addition formulas 

16 Symmetric Krawtchouk Matrices 

■ The matrix T 

■ S-squared and trace formulas 

■ Spectrum of S 

17 Gaussian Quadrature 

■ Zeros of Krawtchouk polynomials 

■ Gaussian-Krawtchouk summation 

18 Coding Theory 

■ Mac Williams' theorem 

■ Association schemes 

19 Appendices 

■ K and S matrices forN from 1 to 14 

■ Krawtchouk polynomials in the variables x,N/i,j/j,N for N from 1 
to 20 

■ Eigenvalues of S 

■ Remarks on the multivariate case 

■ Time-zero polynomials 

■ Mikhail Philippovitch Krawtchouk: a biographical sketch 
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5.7 Appendix 

5.7.1 Krawtchouk matrices 
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K (0) 



= M 



K^ = 



K® = 



KW = 



K&) = 



K m = 
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1 " 
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-1 
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Table 1 
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5.7.2 Symmetric Krawtchouk matrices 



5(0) 



= [i] 



s& = 



s™ = 



s< 4 > = 



s^ = 



s<«> = 
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Table 2 
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5.7.3 Sylvester-Hadamard matrices 



139 



= I 


• ] 




















• • 

• 

• • 

• o 

• • 


■ 
• 

n 


* 

o 
o 
















• o o 

• m 

• • O 

• 


• 

_ 


o 

• 


• o 

• • 

• o 


• 



o 


• 




• 










• • • 

• o • 

• • o 

• o 

• o • 

• • 

• 

• • * 

• o • 

• • o 


• 
o 
o 

• 



o 
• 
• 
o 
o 


o o 
o • 
o o 

o • 

• 

• • 

• 



o • 







• 
• 

• 
o 
o 
o 
o 
• 


o 

• 
• 
o 

o 
o 

* 



• 
• 


• o 

• • 

• 

• • 

• 

• • 


• 
o 
o 

• 
• 
o 


o • o • o 
o • • o o 

• • o o • 

• oooo 
o o • o • 
o o : • • 




• 

• o 

• • 

• 

• • 

• 

• • 

• o 


u 

• 
O 
O 
• 
• 
o 



• 

o 
o 

• 
• 
o 
o 
* 


o • 

• 

• • 
» o 
o o 
o • 
o o 
o • 


• 

• 
o 



o 
o 
• 
• 




o 
o 

• 
o 
• 
• 
o 


• 

o o 
o • 
o 
o • 
o o 
o • 
o o 
o • 


u 
o 
o 
• 
• 
o 
o 
* 
* 


• o • • o 
ooooo 

• o • o • 

• o o • • 
o • • o 
o • • • • 

• • • o 

• • • o 

o • o o • 



Table 3 

Replace • with 1 and o with —1 to obtain Sylvester-Hadamard matrices. 
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Abstract We introduce coupling from the past, a recently developed method for exact 

sampling from a given distribution. Focus is on rigour and thorough proofs. 
We stay on an elementary level which requires little or no prior knowledge from 
probability theory. This should fill an obvious gap between innumerable intuitive 
and incomplete reviews, and few precise derivations on an abstract level. 
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6.1 Introduction 

We introduce a recently developed method for exact sampling from a given 
distribution. It is called coupling from the past. This is in contrast to Markov 
chain Monte Carlo samplers like the Gibbs, sampler or the family of Metropolis- 
Hastings samplers which return samples from a distribution approximating the 
target distribution. The drawback is that MCMC methods apply generally and 
exact sampling works in special cases only. On the other hand, it is the ob- 
ject of current research and the list of possible applications increases rapidly. 
Another advantage is that problems like burn in and convergence diagnostics 
do not arise where exact sampling works. Exact sampling was proposed in the 
seminal paper [J.G. Propp & D.B. Wilson, 1996]. Whereas these au- 
thors called the method exact sampling, some prefer the tevmperfect sampling 
since random sampling never is exact. For background in Markov chains and 
sampling, and for examples, we refer to [G. WINKLER, 1995; G. WINKLER, 
2003]. The aim of the present paper is a rigorous derivation and a thorough 
analysis at an elementary level. Nothing is really new; the paper consists of 
a combination of ideas, examples, and techniques from various recent papers, 
basically along the lines in [F. FRIEDRICH, 2003]. Hopefully, we can single 
out the basic conditions under which the method works theoretically, and what 
has to be added for a practicable implementation. 

Coupling from the past is closely related to Markov Chain Monte Carlo 
sampling (MCMC), which nowadays is a widespread and commonly accepted 
statistical tool, especially in Bayesian statistical analysis. Hence we premise 
the discussion of coupling to the past with some remarks on Markov Chain 
Monte Carlo sampling. Let us first introduce the general framework which 
simultaneously gives us the basis for coupling from the past. For background 
and a detailed discussion see [G. Winkler, 1995]. 

Let Ibe a finite set of generic elements x,y, . . . . A probability distri- 
bution v on X is a function on X taking values in the unit interval [0,1] 
such that ^2 x€X l/ (. x ) = 1- A Markov kernel or transition probability on X 
is a function P : X x X — > [0,1] such that for each x £ X the function 
P(x, • ) : X — ► [0, 1], y i—> P(x, y) is a probability distribution on X. A prob- 
ability distribution v on .Sf can be interpreted as a row vector (i/(x)) xex an d a 
Markov kernel Pasa stochastic matrix (P(x,y)) Xiyex . A right Markov chain 
with initial distribution u and transition probability P is a sequence (£i)i>o of 
random variables the law of which is determined by u and P via the finite- 
dimensional marginal distributions given by 

P(£o = X0,£l = Xl, ... ,£„, = X n ) = v(x )P(x ,Xi) P(x n - 1 ,X n ). 

P is called primitive if there is a natural number r such that P T (x, y) > for 
all x, y € X. This means that the r-step probability from state x to state y is 
strictly positive for arbitrary x and y. If P is primitive then there is a unique 
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probability distribution \i which is invariant w.r.t. P, i.e. \i,P = /i where 
\xP is the matrix product of the (left) row vector /i and the matrix P, and this 
invariant probability distribution fj, is strictly positive. 

The laws or distributions of the variables £ n of such a process converge to 
the invariant distribution, i.e. 

vP P(y)— */i(y), yeX, (1.1) 

cf. [G. WINKLER, 1995], Theorem 4.3.1. Perhaps the most important statis- 
tical features to be estimated are expectation values of functions on the state 
space X, and the most common estimators are empirical means. Fortunately, 
such stochastic processes fulfill the law of large numbers, which in its most el- 
ementary version reads: For each function / on X, the empirical means along 
time converge in probability (and in L 2 ) to the expectation of/ with respect to 
the invariant distribution; in formulae this reads 



1 n— 1 

— 7 /(£;) — ► E (/; /j.) as n — ► oo, in probability, (1.2) 



i=0 



(cf. [G. WINKLER, 1995], Theorem 4.3.2). The symbol E(/;/i) denotes the 
expectation 



E(/;/i) = X^C*M*) 



xex 

of / with respect to yu. A sequence of random variables & converges to the 
random variable £ in probability if for each e > the probability P(|& — £| > 
e) tends to as n tends to oo. Plainly, (1.2) implies that for every natural 
number m, averaging may be started from m without destroying convergence 
in probability; more precisely for each m > one has 



y] /(&) — >E{f;n) as n — ► oo, in probability. (1.3) 



n — m 

i=m+l 

In view of the law of large numbers for identically distributed and independent 
variables, the step number m should be large enough such that the distributions 
of the variables £ m +li • • • i £n are close to the invariant distribution fj, in order 
to estimate the expectation of / with respect to /x properly from the samples 

/(£m+l), ••• ,/(£n)- 
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In fact, according to (1.1), after some time m the laws of the & should be 
close to the invariant distribution /j, although they may be far from yt, during 
the initial period. The values during this burn in period are usually discarded 
and an average (]Cm+i /(&))/( n ~~ m ) like in (1.3) is computed. In general, 
the burn in time can hardly be determined. There are a lot of suggestions rang- 
ing from visual inspection of the time series (/(£»))t>o to more formal tools, 
called convergence diagnosticscomergence diagnostics. In this text we are not 
concerned with burn in and restrict ourselves to the illustration in Fig. 1. A 
Gibbs, sampler (introduced in Section 6.4) for the Ising model is started with 
a pepper and salt configuration in the left picture. A typical sample of the in- 
variant distribution is the right one which appears after about 8000 steps. The 
pictures in-between show intermediate configurations which are pretty improb- 
able given the invariant distribution but which are quite stable with respect to 
the Gibbs sampler. In physical terms, the right middle configuration is close 
to a 'meta-stable' state. Since we are interested in a typical configuration of 
the invariant distribution \x, we should consider the burn in to be completed 
if the sample from the Markov chain looks like the right hand side of Fig. 1, 
i.e. after about 8000 steps of the Gibbs sampler. The curve in the next figure 






Figure 1. Configurations for Ising Gibbs Sampler with = 0.8 starting in a pepper and salt- 
configuration (left), after 150 steps (left middle), after 350 steps (right middle) and after 8000 
steps (right). 



Fig. 2 displays the relative frequency of equal neighbour pairs. Superficial 
visual inspection of this plot suggests that the sampler should be in equilib- 
rium after about 300 steps. On the other hand, comparison with Fig. 1 reveals 
that the slight ascent at about 7800 steps presumably is much more relevant 
for the decision whether burn is completed or not. This indicates that primitive 
diagnostic tools may be misleading. The interested reader is referred to the ref- 
erences in [W.R. GlLKS ET AL., 1996; A. GELMAN, 1996; A.E. RAFTERY 
& S.M. LEWIS, 1996], see [W.R. GlLKS ET AL., 1996b]. If initial samples 
from fi itself are available, then there is no need for a burn in, and one can 
average from the beginning. This is one of the most valuable advantages of 
exact sampling. 

First, we indicate how a Markov chain can be simulated. 
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Example 1 (Simulating A Markov chain) We denote by P the tran- 
sition probability of a homogeneous Markov chain. At each time n > 1, 

u given the previous state x n _i , we want 

-| 1 1 to pick a state x n at random from 

^ 2 P(x n -i,-). For each x, we partition 

3/2 hrH — ^ 1 — - — I the unit interval (0,1] into intervals I* 

of length P(x, y), and pick u n uniformly 
~ 2 ' y 3 ' at random from (0,1]. Given the present 

state a; n _i, we search for the state y 
with u n € 7y n_1 and set x n = y. The picture on the left illustrates this 
procedure for \X\ =3, where x n = yi if x n -\ was yi or 3/2 and £ n = 2/3 if 
#n-i = 2/3- m general, the procedure can be rephrased as follows: Define a 
transition rule for P by 



Vl 



h 



Vi 



+ 



/ : X x (0, 1] — ► JT, /(x, m) = y if and only if 



u G /-. 



More explicitly, enumerate X = {yi,-. . ,Vn} and set f(x,u) = F~(u) 
where F x (u) = P(x, {y, : i < u}) is the cumulative distribution function 
of P(x, ■) and F~(u) = min{t : F x {t) > u) its generalized inverse. Let 
U\,U2, ... be independent random variables uniformly distributed over (0,1], 
and set £0 := xo, and ^ n := f(£ n -l,U n ). Then (^ n ) n >o is a homogeneous 
Markov chain starting at xo with transition probability P. For inhomogeneous 
chains, replace / by f n varying in time. Note that the exclusive source of 
randomness are the independent random variables U{. 
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Figure 2. Convergence Diagnostics for Ising Gibbs Sampler 
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6.2 Exact Sampling 

The basic idea of coupling from the past is closely related to the law of large 
numbers (1.2). According to (1.1), for primitive P with invariant distribution 
H the corresponding Markov chain converges to ju; more precisely 

uP n — ► fi, as n -> oo, (2.1) 

uniformly in all initial distributions v, and with respect to any norm on R x . 

Generalizing the concept of right Markov chains, let us consider now two- 
sided Markov chains with transition probabilities given by a Markov kernel P, 
i.e. double sequences (£i)tez of random variables taking values in X, and with 
law determined by the marginal distributions 

" \%m =r x mi ■ • • i sn = x n) = Vm\%m) * \%mi x m+\) ■'{ x n—li x n) i 

(2.2) 
for m, n € Z, n > m, where v^ denotes the law of £&. 

IfP is primitive, or more generally, if (2.1) holds uniformly, these two-sided 
chains are automatically stationary. This important concept means that a time 
shift does not change the law of the chain; in terms of the marginal distributions 
this reads 

^(sm = Xm > • • • )Sn — x n) == lr(sm+T = x mi • • ■ > ^n+r = x n) (2.3) 

for all m € Z and r € Z, and in particular, that all ^ m in (2.2) are equal to fi. 
In fact, because of (2.2) one has vq = v^\.P k for all k € N. By uniformity 
in (2.1), this implies vq = n and hence in view of (2.2) the process (£j)iez is 
stationary. 

At a first glance, this does not seem to be helpful since we cannot simulate 
the two-sided chain starting at time — oo. On the other hand, if we want to start 
sampling at some (large negative) time n, there is no distinguished state to start 
in, since stationarity of the chain implies that the initial state necessarily is al- 
ready distributed according to /z. The main idea to overcome this problem is to 
start chains simultaneously at all states in X and at each time. This means that 
a lot of Markov chains are coupled together. The coupling will be constructed 
in such a fashion that if two of the chains happen to be in the same state in 
X&t some (random) time, they will afterwards follow the same trajectory for- 
ever. This phenomenon is called coalescence of trajectories. Our definite aim 
is to couple the chains in a cooperative way such that after a large time it is 
very likely that any two of the chains have met each other at time 0. Then, 
at time 0, all chains started simultaneously at sufficiently large negative time 
have coalesced, and therefore their common state at time does not dependent 
on the starting points in the far past anymore. We will show that after complete 
coalescence the unique random state at time is distributed according to the 
invariant distribution fi. 
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To make this precise we consider the following setup: Let X be a finite 
space and let /j, be a strictly positive probability distribution on X. The aim is 
to realize a random variable which exactly has law pt, or - in other words - to 
sample from /i. Since Markov chains have to be started at each time k < 
and at each state x € X simultaneously, a formal framework is needed into 
which all these processes can be embedded. The appropriate concept is that of 
iterated random maps or stochastic flows, systematically exploited in [P. Dl- 
ACONIS & D. Freedman, 1999]. 

Let /i be the strictly positive distribution on X from which we want to sam- 
ple and let P be a Markov kernel on X for which [i is the unique invariant 
distribution. Let $ be the set of all maps from X to itself: 

<Z> = {if : X — ► X} = X x = Map(X, X). 

On this space we consider distributions p reflecting the action of P on X in 
the sense that the p-probability that some point x is mapped by the random 
function ip to some y is given by P(x, y). This connection between p and P is 
formalized by the condition 

(P) p ({if : <p(x) = y}) = P(x, y), x, y € X. 

EXAMPLE 2 Such a distribution does always exist. A synchronous one is 
given by q(ip) = YlxexP( x > ^P^P ))- ^ * s a probability distribution since it can 
be written as a product of the distributions P(x, •). It also fulfills Condition 
(P): Let 0' be the set of all maps from X\{x} to X. Then 

q((p : ip(x) = y) 

= E II p ( z > *>(*)) = p{*i v) E II p ( z > </>(*)) = p& vY> 

the sum over <?' equals 1 since the summands again define a product measure. 

Since we want to mimic Markov processes, we need measures on sets of paths, 
and since we will proceed from time — oo to finite times we introduce measures 
on the set Q, = <P Z with one-dimensional marginal measures p. The simplest 
choice are product measures P = p z . The space O = $ z consists of double 
sequences 

£ = (<Pj)jeZ = (•••» <P-i, V?o, <Pi, ■■■) € Map(X, X) z . 
If J is a finite subset of Z then for each choice tpj, j 6 J, we have 

P({<£ € fi : ipj = ^, j € J}) = IJp(^)- 
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Given a double sequence <p of maps (pj, j € Z, we consider compositions of 
the components ifj over time intervals. For each <p G £l and x € X, set 

tp$(x) = <p k o ■ ■ ■ o tp^x) = tpk(<Pk-i(- ■ • (<Pj(x))), j < k. 
Note that ip\ = <pi. 

REMARK Given Condition (P), for each n € Z and x € X, the process 
£ n = x, in+k = fn+ii 3 *)' & > 1, is a Markov chain starting at x and with 
transition probability P. Hence the stochastic flow is a common representation 
of Markov chains starting at all initial states and at all times; we shall say that 
they are coupled from the past. 

Coupling from the past at time n will work as follows: Pick a double sequence 

• • • i 'Pm i • • • i Vn i • • • 

of maps at random, and fix a number n G Z. Then decrease m until 
<Pm( x ) = w hopefully does not depend on x anymore. If we are successful 
and this happens then we say that all trajectories 

<fim(x),'p m +ioip m (x),...,ip^ l (x), xeX, 

have coalesced. We shall also say that for <p there is complete coalescence 
at time n. This works if sufficiently many of the <pj map different elements 
x to the same image. Going further backwards does not change anything 
since (Pm-k( x ) ~ fmifm-k^)) = w n °lcls as well for all x. This may 
be rephrased in terms of sets as follows: Let tp : X — * X be a map and 
Im<p = {<p(x) : x G X} the image of X under <p. For fixed n the sets Imc/?^ 
decrease as m decreases. Complete coalescence means that Imcp^ is a single- 
ton {w}. Then there is a unique W n (tp) € Xwith 

W»(£)} := D ImyC (2.4) 

m<n 

If there is no coalescence then W n (ip) is not defined. Let us set 
F n = {<p : W n {<p) exists}, F = f] F n . 

Then all W n are well defined on F; to complete the definition let W n (ip) = zq 
for some fixed zo € X if (p £ F. Obviously, independent of the choice of 
x e X, 



jn+k ( ^,\ ^n+fc 

ir n+k 



W n+k (<p) = lim ^ +k (x) = ^Xio lim ^(x) 

— m— >— oo ' m—*—oo (2 5) 



Cl°^fe). ^ €i? ' neZ > fc>0 - 
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This indicates that the random variables W n (</?) have law p. To exploit this ob- 
servation for a sampling algorithm we need almost sure complete coalescence 
in finite time. We enforce this by the formal condition 

(F) P(F) = 1. 

Provided that (F) holds, we call P successful. Condition (F) will be verified 
below under natural conditions. 

LEMMA 3 Under the hypothesis (?) and (F) the process {W m )meZ w a sta- 
tionary homogeneous Markov process with Markov kernel P. 

Proof. Recall that P is a homogeneous product measure, and hence for each 
r 6 Z all random sequences (p m ,...,<p m+T , m € Z, have the same law. 
Hence the stochastic flow is stationary, and the process (W m ) me z is stationary 
as well. Moreover, y?JJ+i depends on <p n +i, ■ ■■ , <p n +k only an d each W m de- 
pends only on ... , ip m -i,(p m . Again, since P = p z is a product measure, the 
variables (p^+i an ^ Wm> m — n > are independent. By (2.4) and (P), 

P(Wn+i = x n+ i, W n = x n , . . . , W n _ fc = x n - k ) 

= F(<p"X\( x n) = X n +1, W n = X n ,..., W n - k = Xn-k) 
= P(Vn+l(^n) = X n+1 )F(W n = X n , . . . , W n - k = X n - k ) 
= P(x n , X n+1 )P(W n =X n ,..., W n - k = X n -k) , 

which shows 

P(Wn+i = x n+ i\W n = X n) ...,W n - k = X n -k) = P(x n ,x n +i). 

Hence P is the transition probability of the process (W m ) me z. ■ Let us put 
things together in the first main theorem. 

THEOREM 4 (EXACT SAMPLING) Suppose that is p a strictly positive prob- 
ability distribution and P a primitive Markov kernel on Xsuch that pP = p. 
Assume further thatp ({</? : <p(x) = y}) = P(x,y)for all x,y € X, and that 
P is successful. Then each random variable W n has law fi; more precisely: 

p ({<p e n : w n (<p) = x}) = /i(x), xex. (2.6) 

Proof. By stationarity from Lemma 3, all one-dimensional marginal distribu- 
tions coincide, and P is the transition probability of (W n ) ne z- If-P is primitive 
then by [G. WINKLER, 1995], Theorem 4.3.1, its unique invariant distribution 
is /i. ■ To sample from p, only one of the W m is needed. 

COROLLARY 5 Under the assumptions of Theorem 4, the random variable 
Wq has law p. 
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The next natural question concerns the waiting time for complete coales- 
cence at time zero. The random times T n of latest coalescence before n are 
given by 

T n (t£) = sup{m < n : there isu; G X such that <Pm( x ) = wfov every a; € X}. 

The numbers T n ((p) definitely are finite if if G F; outside F they may be finite 
or equal — oo. Condition (F) is equivalent to 

P({<£ G Q : T n {<£) > -oo}) = 1 for every n G Z. (2.7) 

Such a random time is also called successful. To realize Wo one subsequently 
and independently picks maps (po, f-i, • ■ • , <p m until there is coalescence say 
in w G X. This element w is a sample from //. For computational reasons, 
one usually goes back in time by powers of 2. Clearly, choosing fco(^) such 

that — 2 k °^' < T n (cp) assures coalescence at time 0. Recall that such a &o(y?) 
exists for each <p G F. An example of a stochastic flow coalescing completely 
at time m = is shown in Fig. 3. We are going now to discuss a condition 
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Figure 3. Latest complete coalescence time before time 

for (F) to hold. Pairwise coalescence with positive probability is perhaps the 
most natural condition and easy to check: 

(C) For each pair x, y G X there is an integer n(x, y) such that 

p»C*.») ({fa, . . . , ^.rt) G *»<■■»> : ^(s) = <tf*»\ V )}) > 0. 

We shall show in Theorem 9 below that (C) and (F) are equivalent. We give 
now a simple example where coupling fails. 
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EXAMPLE 6 Consider P with invariant ^onI= { 1 ,2 } given by 

P = 



($$)• "=(V2,l/2). 



Letp(i) = 1/2 = p(V>) for the identity map t(l) = 1, t(2) = 2, and the flip 
map tp(l) = 2, tp(2) = 1. Compositions oft and t/) never will couple. On 
the other hand the flow is associated to P since p({<p '■ <p{x) = y}) = 1/2 = 
P(x, y), regardless of a: and y, and Condition (P) holds. 

We shall show now that the coupling condition (C) implies complete coales- 
cence (F) (and the converse). The latter condition may be rephrased as follows: 
All random times T n are finite almost surely. By stationarity this boils down 
to: The random time Tq is finite almost surely. The simplest, but fairly abstract 
way to verify (F) is to use shift in variance of Fand ergodicity of P. We will 
argue along these lines but in a more explicit and elementary way. The first 
step is to ensure existence of a finite r such that the flow coalesces completely 
in less than r steps with positive probability. 

LEMMA 7 Under condition (C) there is a natural number t such that 

P({£ : 7b( £ ) > -r}) > 0. 

Proof. Let n c = max{n(x,y) : x,y € X}. If <Pi(x) = fi{y) for some 
n < ra c then <fi c (x) = f^+i ° fi( x ) = fi c (y) as we ^- Hence Condition (C) 
implies 

g = min{p nc {(</>!, ... ,<p ne )--<p1 e (x) = <Pi e (v)} :x,yex}>0. 

Therefore \X \ > llm^ ! at least with probability q > if|X| > 2. Similarly, 
|Im<£>" c | > llm^" ! with probability at least q 2 if the left set is no singleton. 
This holds because tp^ = Vnc+i ° ^l" anc ^ tne var i a bles <pi, . . . , <p nc and 
V?n c +ii • • • i^nc are independent and identically distributed. By induction, 

\X\ > \lm<p^\ > \lm<p 2n <\ > > llm^l 

at least with probability q k until the last cardinality becomes 1 ; this happens 

after at most \X\ — 1 steps. Let r = (\X\ — l)n c — 1. Nothing changes if we 

renumber the maps as <p_ T , . . . , yo» m < 0. Hence P({|Im^ r | = 1}) > q T 

and the lemma is proved. ■ 

The next step is a sub-multiplicativity property of probabilities for coalescence 

times. 

LEMMA 8 Let n,m < be negative integers. Then 
P(T <m + n)< P(T < m)F(T m <m + n)= P(r < m)P(T < n). 
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Proof. Suppose that Tq(<£) <m + n. This holds if and only if lm(p^ n+n+1 has 
more than one element. Then both, Imyj°, +1 and Im^™ +n+1 ,have more than 
one element. Hence 

F(<p : To(<p) <m + n) < P(<£ : T (<p) < m and T m (<p) <m + n). 

To check whether Tq{i£) < m holds true it is sufficient to know the maps 
tp m+ l , . . . , v?o. and similarly, to check T m ((p) <n + m only </> m+ n+i ,...,ip m 
are needed. Hence the respective sets are independent and the inequality holds. 
The remaining identity follows from stationarity. ■ In combination with The- 
orem 4, the next result completes the derivation of exact sampling. 

Theorem 9 The Conditions (F) and (C) are equivalent. In particular, the 
process governed by P is successful under (C), and almost sure coalescence in 
Theorem 4 is assured. 

Proof. Suppose that (C) holds. By Lemma 7, we have W(Tq > — r) > and 
Lemma 8 implies 

P(To < -nr) < P(r < -r) n = (1 - P(T > -r)) n — ►Oasn^oo. 

By stationarity, this implies (F). Conversely, suppose that (F) holds, i.e. that 
F(F) = 1. Since F is the intersection of the sets 

F n — {(p: there is m < n such that |Im<£>™ | = 1} 

each of these sets has full measure 1 as well. Fix n now. Plainly, the sets 

F% l = {<p: |Im^| = l} 

increase to F n as m decreases to — oo. Hence there is m < n such that 
P(-^m) > 0- Choose now x ^ y in X. Since <^~ n+1 and <£>" are equal 
in law, for r = n — m + 1 one has 

?'({¥>!, ...,<p T }: <p[(x) = tf(y)) = P(^ : y&(x) = <p n m {y)) > P(F^) > 0, 

and (C) holds. ■ This shows that any derivation of coupling from the past 
which does not explicitly or implicitly use a hypothesis like (C) or a suitable 
substitute is necessarily incomplete or incorrect. 

Remark It is tempting to transfer the same idea to 'coupling to the future'. 
Unfortunately, starting at zero and returning the first state of complete coales- 
cence after zero, in general does not give a sample from ^. 

The reader may want to check the following simple example from 
[F. FRIEDRICH, 2003]. 
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Example 10 Let X — {1,2}. Positive transition probabilities P and their 
invariant distributions \i have the form 

P:=( l ~ X , A Y 0<A,k<1, M = / K A 



Start two independent chains 77 and £ with transition probability P at time 
from 1 and 2, respectively. The time of first coalescence in the future is 

T := min{m € N : ry m = £ m }. 

Denote the common law of ryr and fr by Q- We will shortly verify that g — jj, 
if and only if A = k. Compute first 

^{Vn = £ra = 1, Vm ¥" £m, m < n) 

= «(1 - A) £(J)((1 - A)(l - K )) k (\K) n - k 

fc=0 

= k(1 - A)((l - A)(l - k) + A«) n = «(1 - A)(l - (A + k - 2Xk)) u 
and 

k(1 - A) 



,(1) = «(1 - A) g (!-(» + „_ »«))" = ^^L_ . 

k(1 - A) A(l - «) 



n=0 

Hence 



4 k(1 - A) + A(l - k) ' «(1 - A) + A(l - /c) 
This is the invariant distribution /j, if and only if A = K. 

The representation of Markov chains by stochastic flows is closely con- 
nected to the actual implementation of coupling from the past. Extending 
previous notation, a transition rule will be a map / : X x © — ► X, with 
some set © to be specified. Let now V{, i € Z, be independent identically 
distributed random variables taking values in ©. Then ipi = /(•, Vi), % £ Z, 
is a stochastic flow. If, moreover, F(f(x, Vi) = y) = P(x,y) then the flow 
fulfills Condition (P). The remaining problem is to construct a transition rule 
such that the associated flow fulfills Condition (C) too. 

Example 1 1 Recall from Example 1 how a Markov chain was realized there. 
Let again f(x, u) be a deterministic transition rule taking values in X, such that 
for a random variable U with uniform distribution on © = [0, 1] the variable 
f(x, U) has law P{x, •). This way we - theoretically - may for an m < 
realize all values (p^x), x e X, and check coalescence. If we go back k more 
steps in time we need all (p^ o <pj£~£(a;). Since the maps <^ , • • • , <p m are kept, 
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we must work with the same random numbers uq, . . . , u m , i.e. realizations 
of the [To, • • • , U m , as in the preceding run, and only independently generate 
additional random numbers u m -i, • • • > Um-k- For this special coupling there is 
complete coalescence at time in finite time. The strength of coupling depends 
on the special form of/ which in turn depends on the concrete implementation. 
In Example 1, for each x G X, we partitioned [0,1] into intervals I y of length 
P(x, y) and in step n took that y with U n € I y . The intervals I y , with left end 
at have an intersection I y * of length at least min I|2/ P(x, y). 

This simultaneously is the probability that U falls into I y * and all states co- 
alesce in y* in one single step, irrespective of x. We may improve coupling 
by a clever arrangement of the intervals. If we put the intervals Iy, for which 
\rj mm{\Iym | : x 6 X\} is maximal, to 

i I I | the left end of [0,1] then we get the 

£i ^2 ^3 lower bound max y mirix P(x,y) for the 

X2 I — |j ; 1 1 coalescence probability. We can im- 
prove coupling even further, splitting the 



X] 
X\ r 



X2 £3 



— 1 ~ HH intervals into pieces of length min{|/y | : 

x £ X} and their rest, and arrange the 
equal pieces on the left of [0,1]. This gives a bound J2 y mrn a: P{ x , v)- 
Note that although all these procedures realize the same Markov kernel P they 
correspond to different transition rules, to different stochastic flows, and to 
different couplings. Apart from all these modifications, we can summarize: 

PROPOSITION 9 Suppose that P>0. Then all stochastic flows tpi = /(•, Ui) 
from the present Example 11 fulfill Condition (C). 

Note that the distribution of all these random maps definitely is not the syn- 
chronous one from Example 2. For this distribution, set = [0, l]'*', use inde- 
pendent copies t/fc, z £ X, of C/Jt, and let tp k (x) = f(x, (U%) z€X ) = g(x, U%) 
for g on Xx [0,1] constructed like above. Condition (C) is obviously fulfilled 
and coupling from the past works also for this method. 

REMARK In Example 1 1 we found several lower bounds for the probability 
that states coalesce in one step. An upper bound is given by 

P(¥>(x) = y>(y)) = X>(¥>(x) = Z MV) = z ) 

z 

< Y, pM*) = *) a Hv(v) = *) = J2 p ( x ' *) A p (v> *)■ 

Z 2 

This is closely related to DOBRUSHIN'S contraction technique, which in the fi- 
nite case is based on Dobrushin 's contraction coefficient c(P) — 1— Yl z P{ x i z )^ 
P(y, z), cf. [G. WINKLER, 1995], Chapter 4. The relation is 

P(<p(x) = <p(y))<l-c(P). 
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This upper bound is not sharp. 

6.3 Monotonicity 

Checking directly whether there is complete coalescence at time starting at 
more and more remote past times and at all possible states is time consuming, 
and even impossible if the state space is large (as it is in the applications we 
have in mind). If coalescence of very few states enforces coalescence of all 
other states then the procedure becomes feasible. One of the concepts to make 
this precise is monotonicity. We are now going to introduce this concept on an 
elementary level. 

DEFINITION 12 A partial order on a set X is a relation x ■< y between 

elements x, y € X with the two properties 

(i) x ■< x for each x € X (reflexivity) 

(ii) x ■< y and y < z implies x ■< z (transitivity). 

Recall that a total order requires the additional condition that any two elements 
x, y € X are comparable, i.e x ■< y or y ^ x. 

EXAMPLE 1 3 (a) The usual relation x < y on R is a total order. In the 
component-wise order on R rf , (x\, . . . , Xd) ■< (j/i, . . . , yd) if and only if 
%i < 2/i for each i. It is a partial but no total order since elements like (0,1) 
and (1,0) are not related, (b) If X = {±1} S , then in the component-wise or- 
der from (a), the constant configurations 6=1 and w = — 1 are maximal and 
minimal, respectively, i.e. x ■< b and w ■< x for every x € X. This will be 
exploited in exact sampling for the Ising field in Section 6.4. 

Next we want to lift partial orderings to the level of probability distributions. 
Call a subset / of X an order ideal ifxEl and y ■< x imply y €: I. 

Example 14 (a) The order ideals in Mwith the usual order are the rays 
(— oo, u] and (— oo, u), u G R. 

(b) In the binary setting of Example 13(b), x ■< y if each black pixel of x is 
also black in y (if we agree that x s = +1 means that the colour of pixel s is 
black). The order ideals are of the form {x E X : x ■< y}. 

DEFINITION 15 Let (X, ■<) be a finite partially ordered set, and let u and /j, 
be probability distributions on X. Then u -< /x in stochastic order, if and only 
ifv(I) > fJ,(I) for each order ideal I. 

EXAMPLE 16 Let v and^t be distributions on R with cumulative distribution 
functions F v and F^, respectively. Then v -< /j, if and only if 
fi((—oo,u]) < i/((—oo,u]) if and only if F M (u) < F v (u) for every u € R. 



158 RECENTS AD VANCES IN APPLIED PROBABILITY 

This means that 'the mass of v is more on the left than the mass of /z\ For 
Dirac distributions e u ■< e v if and only if u < v. 

The natural extension to Markov kernels reads 

DEFINITION 17 We call a Markov kernel P on a partially ordered space 
(X, z<) stochastically monotone, if and only ifP(x, •) ^ P(y, •) whenever 
x^y. 

In Example 1 1 we constructed transition rules / for homogeneous Markov 
chains, or rather Markov kernels P. A transition rule is called monotone if 
f(x, u) ^ f(y, u) for each u whenever x -< y. Plainly, a monotone transition 
rule induces a monotone Markov kernel. Conversely, a monotone kernel is 
not necessarily induced by a monotone transition rule, even in very simple 
situations. [D.A. Ross, 1993], see [J.A. FILL & M. MACHIDA, 2001], 
p. 2., gives a simple counterexample: 

Example 18 Consider the space X = {u, v,a,b} and let u ■< a, b, and 
a, b ■< v. Define a Markov kernel P by 

P(u, u)= 1/2 =P(u, a), P(a, u)= 1/2 =P(a, v) 
P(b, a) = 1/2 = P(b, b) , P(v, o)= 1/2 =P(«, v) 

The order ideals are 0, {u}, {a,u}, {b,u} and X, and 
it is readily checked that P is monotone. Suppose now 
that there are random variables with £ u -< £ a , ^ ■< £„ 

almost surely and with laws P(u, ■), P(a, •), P(b, •), and P(v, ■), respectively. 

We shall argue that 

P(& = a)=P(£ u = a,£ a = v,£ b = a,£ v = v)=l/2 
Pfe = 6) = P(£ u = u, 6 = 6, Sv = v) =1/2 . 

The two events are disjoint and hence P(£„ = u) = 1 in contradiction to 
P(£ v = u) = 1/2. We finally indicate how for example the first identity can be 
verified: Since £ u ^ £ a one has P(£ u = a) = P(£ u = a,£ a 6 {a, v}). Since 
P(£a = a) = 0, we conclude P(£ u = a) = P(£ u = a,£ a = t>). Now repeat 
this argument two times. 

Suppose now that the partially ordered space (X, ;<) contains a minimal 
element it and a maximal element v, i.e. u ^ x ^ u for every x € X. Suppose 
further that the stochastic flow is induced by a monotone transition rule, i.e. 
<Pi(x) = f(x,Ui) and f(x,u) r< /(y,«) if a: ^ y. Then 

¥>m( M ) ^ Vm(») =< ¥>m(«) for ever y ^ <E X, m < n, 
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and (Pm( x ) = w,m < 0, for each x € X, as soon as <p%,(u) = w = V?m( u )- 
The previous findings can be turned into practicable algorithms. 

PROPOSITION 10 Suppose that Pis monotone and (X,^<) has a minimum u 
and maximum v. Then coalescence for u and v enforces complete coalescence. 

6.4 Random Fields and the Ising Model 

Random fields serve as flexible models in image analysis and spatial statis- 
tics. In particular, any full probabilistic model of textures with random fluc- 
tuations necessarily is a random field. Recursive (auto-associative) neural net- 
works can be reinterpreted in this framework as well, cf. e.g. [G. Winkler, 
1995]. To understand the phenomenology of these models, sampling from their 
Gibbs distribution provides an important tool. In the sequel we want to show 
how the concepts developed above serve to establish exact sampling from the 
Gibbs distribution of a well known random field - the Ising model. 

Let ^pattern or configuration be represented by an array x = (x s ) se s of 
'intensities' x s E G s in 'pixels' or 'sites' s 6 S with finite sets G s and S. S 
might be a finite square grid or - in case of neural networks - an undirected 
finite graph. A (finite) random field is a strictly positive probability measure TI 
on the space X = Tlses^s of all configurations x. Taking logarithms shows 
that TI is of the Gibbsian form 

II(x) = Z- l e W (-K(x)), Z = Y^eM-K(z)), (4.1) 



with a function K on X. It is called a Gibbs fields with energy function K and 
partition function Z. These names remind of their roots in statistical physics. 
For convenience we restrict ourselves to the Gibbs,sampler with random 
visiting scheme. Otherwise we had slightly to modify the setup of Section 6.2. 
Let prt be the projection X — ♦ Gt, x h-> xt- For a Gibbs field TI let 

Tl(y s | x ty t y^s) = Tl(pr s = y s \ pr t = x t , t ^ s) (4.2) 

denote the single-site conditional probabilities. The Gibbs sampler with ran- 
dom visiting scheme first picks a site s € S at random from a probability dis- 
tribution D on S, and then picks an intensity at random from the conditional 
distribution (4.2) on G s . Given a configuration x = (xt) this results in anew 
configuration y = (y t ) which equals x everywhere except possibly at site s. 
The procedure is repeated with the new configuration y, and so on and so on. 
This defines a homogeneous Markov chain on Zwith Markov kernel 

P(x, y) = J2 D(s)TI {s} (x, y), x,yeX, (4.3) 

ses 
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where rit s }(x, y) = Il(y s \xt,t ^ s) if x and y are equal off s and 
II{ s \(x,y) — otherwise. These transition probabilities II{ s y are called the 
local characteristics. D is called theproposal or exploration distribution. 

We assume that D is strictly positive; frequently it is the uniform distri- 
bution on S. Then P is primitive since P' s ' is strictly positive. In fact, in 
each step each site and each intensity in the site has positive probability to be 
chosen, and thus each y can be reached from each a; in \S\ steps with positive 
probability. It is easily checked - verifying the detailed balance equations - that 
II is the invariant distribution of P, and thus the invariant distribution of the 
homogeneous Markov chain generated by P. 

Example 19 (The Ising model) Let us give an example for exact sam- 
pling by way of the Ising model. The ferromagnetic Ising model with magnetic 
field h := (h s ) se s is a binary random field withG s = {—1,1} and energy 
function 

K ( x ) = Yl XsXt ~ Yl hsXs > 
s~t s 

where > 0, h s G M. and s ~ t indicates that s and t are neighbours. For 
the random visiting scheme in (4.3) the Markov chain is homogeneous and fits 
perfectly into the setting of Section 6.2. The formula from [G. WINKLER, 
1995], Proposition 3.2.1 (see also [G. WINKLER, 1995], Example 3.1.1) for 
the local characteristics boils down to 



p + (x) = U(X S = l\X t = x t , t ? s) = (l + exp ( - 2pJ2 x * ~ h »j) 



-1 



This probability increases with the set {t € S : xt = 1}. Hence 
P + (y) > P + ( x ) if x Z^ V in the component- wise partial order introduced in 
Example 13. The updates x' and y' preserve all the black sites off s, and pos- 
sibly create an additional black one at s. We conclude that P from (4.3) is 
monotone and fulfills the hypotheses of Proposition 10. Hence for complete 
coalescence one only has to check whether the completely black and the com- 
pletely white patterns coalesce. For transition rules like in Example 1 1 the 
Condition (C) on page 152 is also fulfilled and coupling from the past works. 

6.5 Conclusion 

The authors are not aware of other mathematical fields, where so many in- 
sufficient arguments, ranging from incomplete or misleading, to completely 
wrong, have been published (mainly in the Internet). In particular, Condi- 
tion (C) or a substitute for it, are missing in a lot of presently available texts. 
A rigorous treatment is [S.G. Foss & R.L. Tweedie, 1998]. These au- 
thors do not use iterated random maps. These are exploited systematically in 
[P. Diaconis & D. Freedman, 1999]. [J.A. Fill, 1998] introduces 'in- 
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terruptible' perfect sampling based on acceptance/rejection sampling. Mean- 
while there is a body of papers on exact sampling. On the other hand, the 
field still is in the state of flux and hence it does not make sense to give fur- 
ther references; a rich and up to date source is the home-page of D.B. WIL- 
SON, http://www.dbwiison.com/exact/. The connection between tran- 
sition probabilities and random maps was clarified in [H.V. WEIZSCKER, 
1974]. 
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Abstract The purpose of this paper is to review the different generalizations and exten- 

sions of the ergodic theorem of information theory in terms of reference mea- 
sure, state space, index set and required properties (ergodicity, stationarity, etc.) 
of the process, from the original Shannon-McMillan-Breiman version to its lat- 
est developments. 
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7.1 Introduction 

The statement of convergence of the entropy at time n of a random pro- 
cess divided by n to a constant limit called the entropy rate of the process is 
known as the ergodic theorem of information theory or asymptotic equireparti- 
tion property (AEP). Its original version proven in the 50's for ergodic station- 
ary processes with a finite state space, is known as Shannon-McMillan theorem 
for the convergence in mean and as Shannon-McMillan-Breiman theorem for 
the almost sure convergence. Since then, numerous extensions have been 
made in direction of weakening the hypothesis on the reference measure (from 
the counting or product measure to Markovian or semi-Markovian measures), 
state space (from a finite set to any Borel set), index set (from discrete-time to 
continuous-time, product sets or groups) and required properties (ergodicity, 
stationarity, etc.) of the process. 

The purpose of this paper is to review these different generalizations and 
extensions. Some necessary basics are given in Section 7.2 concerning entropy 
definition, ergodicity, stationarity and Markovian measures and processes. Gen- 
eral statement and applications of the AEP are given too. The original AEP 
with hints of proof is presented in Section 7.3.1. Extensions in terms of state 
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space are considered in Section 6.4, in terms of measures in Section 7.3.3 and 
to continuous-time processes in Section 7.3.4. Finally, explicit expressions of 
the entropy rate for Markovian and Gaussian processes are given in Section 

7.4. 

7.2 Basics 

7.2.1 Definition of entropy 

The concept of entropy is the basis of information theory. It has first been 
introduced in the field of probability by Boltzman in the XK-th century in 
statistical mechanics and then by Shannon (1948) for studying communication 
systems. 

DEFINITION 1 The entropy of a probability distribution P with density p with 
respect to a reference measure /J, is defined as Boltzman 's H-function, that is 
to say 

S(P) = - fp(x) logp(x)dx = H p , 

with the convention log = 0. 

It inspired Shannon (1948) to define and study the entropy of a discrete distri- 
bution taking n values as 



8 ( p ) = ~X>logPi. 

i=\ 

The function S has interesting properties as measure of uncertainty in com- 
munication theory, see Reza (1961), Ash (1965), Cover & Thomas (1991). 
Actually, in the continuous case, these properties cannot be derived from the 
discrete case. For example, the entropy of the uniform distribution U on an 
interval [a, b] equals log(6 — a) but the entropy of the uniform distribution U 
on a partition in n values of the same interval equals logra. The link between 
these two separate notions was made by Kullback & Leibler (1951), see also 
Kullback (1978). 

DEFINITION 2 Let P and Q be two distributions on the same measurable 
space. The Kullback-Leibler information ofP relative to Q is defined as 

§(P|Q) = 5>log^, 
i qi 

for discrete distributions and as 

S (p|0) = Ep (.o g g)=/i„ g gM d P(x), 
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ifP is absolutely continuous with respect to Q (and as +00 if not). 

The definition can be extended to two positive measures on the same measur- 
able space. 

In both discrete and absolutely continuous cases, we have 

S(P|Q)=sup£P(^log^, 

where the supremum is taken on all finite partitions of the space, and 

§(P) = §(U) - §(P I U). 

The meaning of entropy appears thus as well in information theory as in statis- 
tical mechanics. In the former, it measures the variation of information from 
the uniform distribution to P, hence has a meaning as a measure of uncertainty 
of the system. In the latter, a system is in equilibrium if the probability den- 
sity (or number of particles in an infinitesimal volume) is close to the uniform 
repartition. 

The entropy methods can also be justified by purely probabilistic or statis- 
tic arguments (large deviations principle, Bayesian statistics, properties of the 
induced estimates, etc.), see Csizar (1996), Garret (2001), Grendar & Grendar 
(2001), and particularly for Markov chains, Moran (1961). 

Basic properties of entropy and links with communication theory are given 
in Girardin & Limnios (2001a). For a detailed study, see Reza (1961), Ash 
(1965), Guiasu (1977), Cover & Thomas (1991). 

DEFINITION 3 The entropy at time n of a discrete-time stochastic process 
X = {X n ) n€ ® taking values in (E, S) is by definition the entropy of its Tri- 
dimensional marginal distribution, namely 

H n (X) = -E(logp*(*i,... > x B )] 

= _ p x {xi, . . . ,x n )logp x (xi, . . . ,x n )dn n (xi, . . . ,x n ), 

where p^is the density of the random vector X — (Xi, . . . , X n )with respect 
to the n-th marginal /i n of a reference measure fj, on the infinite product space 



The entropy at time n is a nondecreasing nonnegative function of n. It can also 
be seen as the Kullback information H n (X | Y) of the marginal distribution of 
X relative to the marginal distribution /x n of a process Y, also called relative 
entropy between X and Y. For this point of view, see especially Pinsker (1960) 
and Perez (1964). 
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Under suitable conditions, the entropy at time n divided by n converges, 
H»(X) 



n 



H(X), n -» +00. (2.1) 



If the limit H(X) (or H(X | Y)) exists and is finite, it is called the Shannon 
entropy rate of the process and we have H(X) = inf H n (X)/n. 

For simplification, let us set x™ = (x n , ... ,x m ). Seth n (x) = — logp n (x") 
and let g m>n (x) = p n (a;^ l )/p n (x^ _1 ) be the conditional density of X n rel- 
ative to (X m , . . . ,X n -i). If — Eloggo,nPO converges to some limit, then 
H n (X)/n is the Cezaro's sum of the sequence and hence converges to the 
same limit. The entropy rate is sometimes defined in this way, see for example 
Reza (1961). 

The convergence in (2.1) appears as the consequence of the convergence in 
mean of the sequence of random variables (— logp n (Xi, . . . ,X n )/n). The 
almost sure convergence is also of interest. They constitute together the er- 
godic theorem of information theory also called Shannon-McMillan-Breiman 
theorem, or Asymptotic Equirepartition Property. 

Theorem 1 (Ergodic theorem of information theory) 
Under suitable conditions on the process X, its index-space T, its state space 
E and the reference measure fi, the sequence (— \ogp n (X\, . . . , X n )/n) con- 
verges in mean or almost surely to the entropy rate of the process. 

In the following, for simplification, we will call mean AEP the convergence 
in mean and strong AEP the almost sure convergence. 

Similarly, for continuous-time processes, we get the following definition. 

DEFINITION 4 The entropy at time T of a continuous-time process X = 
(Xt)teU+ is defined as 

Hr(X) = - / pr(x) log p T (x)dp, T (x), 

where pr{x)is the likelihood of (X t )o<t<Twith respect to the restriction fir 
to [0, T] of some reference measure p,. 

The definition of the entropy rate and the statement of the corresponding AEP 
derive immediately. 

This theorem has many applications. Let us list some of them. 
First of all, the application which made M. McMillan call it AEP. The typical 
set of a process is defined as the set of sequences (xi, . . . , x n ) such that 



2 -nH(X)+e < pn ( x nj < 2 -nH(X)- £) 
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and a sequence (X\, . . . ,X n ) is said to be typical if its density satisfies the 
above relation. From the mean AEP for a finite state space, the probability of 
the typical set is proven to be nearly one; all its elements are nearly equiproba- 
ble and this set contains nearly 2 nHI ( x ) elements, see Cover & Thomas (1991) 
(with application to data compression). For a Borel space, the distribution of 
(Xi, ... , X n ) is proven to be asymptotically uniform on the typical set, which 
has the least asymptotic volume (equal to 2~ nM ( x among sets of high proba- 
bility. And from the almost sure convergence, the sequences (X\, . . . , X n ) are 
proven to be almost surely typical for large n, see Barron (1985). 

The AEP has thus a prominent role in information theory together with the 
linked Shannon channel coding theorem. This theory has been presented in 
many books since the original exposition of Shannon (1948), see for example 
Khinchin (1957), Feinstein (1958), Gallager (1968) and more recently Guiasu 
(1977), Cover & Thomas (1991). 

It also plays a role in finance, see for example Algoet & Cover (1988b) and 
Algoet (1994) and the reference therein. 

Many applications of the maximum entropy methods involving the entropy 
rate exist in the literature, see, e.g., Girardin (2002) and the references therein 
for applications involving Markov chains and processes. 

Application to statistical inference derives too, for example through likeli- 
hood maximization, often equivalent to maximization of the entropy rate. 

Large deviations results derive too. See Gallager (1968) or Cover & Thomas 
(1991) for results in information theory, and Ellis (1985) for a statistical me- 
chanics point ofview. 

Linnik (1959) initiated the use of entropy for proving limit theorems in a 
proof of the central limit theorem. For other examples and recent develop- 
ments, see Johnson (1999) and the references therein. 

7.2.2 Ergodicity, stationarity and Markov properties 

The AEP involves the notions of ergodicity and stationarity. Let us recall 
theirdefmitions. 

DEFINITION 5 Let (tt,A, P) be a probability space, A process (X n ) taking 
values in E can be defined as X n (u) = X{S n uj), where X is a random vari- 
able with values in E and S is a shift from Q to itself. 
The shift (and thus the process) is said to be 



stationary if F(S A) = P(A)for all A G A; 
ergodic ifSA = A implies ¥(A) = Oor 1. 
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For an ergodic shift, the strong law of large numbers takes the form 
- n— 1 

-y>(SV)— >V(A), AeA, 

fc=o 

and implies the following Birkhoff (or individual) ergodic theorem, see Doob 
(1953). 

THEOREM 2 IfT is an ergodic shift of (fi, A, P) and F is an integrable func- 
tion on Q, then 



1 n— l 

- ^ F(S k uj) — ► EF, a.s. and in 



n k =o 
The following extension is due to Breiman (1957). 

THEOREM 3 If(Fk) is a uniformly L l -bounded (i.e., such that Esup fc F^ < 
+oo) sequence of measurable functions converging almost surely to some func- 
tion F, then 

n-l 



- y^F k (S k u) — ^EF, a.s. 
n j—^ 



Stationarity and ergodicity can equivalently be defined by considering the 
state probability space, i.e., £? N endowed with the cr-algebra £ N and the law 
of the process, say Px, defined on (£ lN ,^ r = S N ). The process is station- 
ary or ergodic if the translation shift 9 defined by (9x) n = £ n +i (where 
x = (xo, . . . , x n , . . .)) is thus for Px- The ergodic theorems involve then 
f(6 k x) instead of /(T fc oj) for any integrable function / defined on E N . 

The same notions can be defined and considered for a continuous-time pro- 
cess X = (X t )teT, with a group of shifts {T l ,t e T}, or a translation of the 
state space E T . 

The entropy rate can also be defined in terms of shifts of the finite measure 
space (£l,A,F) as follows, see Billingsley (1978) -giving many connections 
between information theory and ergodic theory, or Guiasu (1977). The entropy 
of a finite cr-field B C A is defined as 

h(B) = £>(£) log P(5), 
BeB 

the entropy of B relative to a shift 5* is 

1 n— 1 

h(S,B) = \jm n -+ + oo-h(\J S~ k B), 

fc=0 
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where BV B' denotes the a-field generated by B U B', and finally the entropy 
(rate) of S is 

h(S)= sup h(S,B). 
BcA 

See Krengel (1967) for a generalization to cr-finite measure spaces. 

Markovian measures and processes play a prominent part in entropy theory. 
Let us recall some definitions. 

The theory of Markov processes and its extensions was initiated by A. 
Markov (1856-1922) through the property which bears his name: the future 
of a Markov process depends on its past only through its present, see Ander- 
son (1991). Several generalizations were proposed since, all of them in the 
aim of weakening the Markov property, as for example the semi-Markov pro- 
cesses introduced by P. Levy (1954) and W. Smith (1955). The latter generalize 
in a natural way the pure jump Markov processes and the renewal processes. 
The future evolution of a semi-Markov process depends on its present state 
and on the time elapsed since the latest transition, while the evolution of a pure 
jump Markov process depends only on its present state, see Limnios &Opri§an 
(2001). 

DEFINITION 6 A probability measure on the product space (E N ,F) is Marko- 
vian of order r EN if 

H\Xn E t I x m , . . . , x n —i) = /-t\X n E b I x n — r , . . . , Xji— i), 

F e£, m<n-r EN. 

It is homogeneous (or has stationary transition probabilities) if 

/j,(x n E F | x n -i) = 9 n n{xi E F | x ), a.s., F € £, n E N. 

A Markovian measure of order one is just said to be Markovian. If the order 
is zero, the measure is just the product of independent measures. 

A process whose distribution is a Markovian measure is a Markov chain or 
discrete Markov process, P = (P(X& = j \ X n _i = j))ij£E is i ts transition 
kernel and a probability tt such that nP = 7r is its stationary distribution. Gen- 
eral continuous-time Markov processes are defined in a similar way. 

Continuous-time jump Markov processes can also be seen as special semi- 
Markov processes. The definition of semi-Markov processes is easier in terms 
of Markov renewal processes, here only with a countable state space. 

DEFINITION 7 Let (£1, F, P) denote a probability space and let E be a finite 
or countable set. A process (Jn,Sn)n>o is a Markov renewal process with 



1 70 RECENTS AD VANCES IN APPLIED PROBABILITY 

semi-Markov kernel Q(t) = (Qij(t);i,j € E),fort > 0, if 

P(^n+l = h S n +1 — S n < t \ J\, . . . , J n -1, Jn = h Si, . . . , S n ) = 
= P(Jn+l = J, S n +1 -S n <t\J n = i) = Q tj (t). 

The process (J n ) is an E-valued Markov chain with transition kernel 
P — (P(i,j))ijeEi where P(i,j) = limt-^+oo Qij{t), see for example Lim- 
nios & Oprigan (2001). And the times = So < ■ ■ ■ < S n < • • • are the 
M+ -valued jump times of the corresponding £- valued semi-Markov process 
X = (X t ) t >o defined by 

X t = J n ifS n <t<Sn+i. (2.2) 

A jump Markov process is a semi-Markov process with semi-Markov kernel 
Qij(t) = ay(l — e~ ait )/a,i, where a* = - Yljfr °*i- Tne matrix A = (ay) is 
called its infinitesimal generator and a probability 7T = (in) such that 7rA = 
is called its stationary distribution. 

7.3 The theorem and its extensions 

First, let us see conditions for (2.1) to hold. If E is finite, E[h n (X)]/n is 
bounded and hence by Fatou's lemma h(X) = lim h n {X)/n is integrable. If 
Y = 0X, then, by entropy properties, h n (X) < h n (Y). If X is stationary, 
then Eh(X) = Eh(Y)so h(X) = h(Y) and h e nh(X) s aninvariantfinite 
random variable. Thus, the limit of H„(X)/n is a random variable which is 
invariant by 9; the ergodicity of the process ensures that almost surely this 
entropy rate is constant. Ifis is not finite, fmiteness ofH(X) (or equivalently 
up-boundedness of the sequence H n (X)/n) will be a necessary condition for 
the AEP to hold. 

7.3.1 The original AEP 

Shannon (1948) stated the convergence in probability of (— logp n (Xf )/n) 
for ergodic finite processes, and proves it for i.i.d. sequences and for Markov 
chains, using the law of large numbers. 

McMillan (1953) proved the convergence in mean for stationary ergodic 
processes with a finite state space. This constitutes the Shannon-McMillan 
theorem. He writes 



1 n— 1 
-logpnix^' 1 ) = J2log 90ik (d k x), 



fe=0 



which is the basis ofproof of most of the different extensions of the AEP. Here, 
the reference measure is the counting measure, as for all finite or countable 
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valued discrete-time process, and so p n (xQ~ ) = ^P^o = £o, • • • >-XVi-i = 
x n -i). He uses a martingale argument to prove the almost sure convergence 
of (— logpo,n(X)) to a limit — log ^(X) and derives its convergence in mean 
from the fmiteness ofE. The AEP is then given by the mean ergodic theorem 
applied to — log<7(X). Gallager (1968) gave a simpler proof avoiding martin- 
gale arguments. 

The almost sure convergence proven by Breiman (1957,1960) constitutes 
the Shannon-McMillan-Breiman theorem, also called ergodic theorem of in- 
formation theory or strong AEP. He proves the almost sure convergence of 
(— logt/o.nPO) as a nonnegative lower semi-martingale and uses then Theo- 
rem 3. See also Shields (1987) for an alternative proof using a sample path 
covering argument. 
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Fig. 1 : The original AEP. 

7.3.2 Extensions in terms of state space 

The extension to a countable state space was made by Carleson (1958) for 
the convergence in mean (see also Parthasarathy (1964) for a simple proof) and 
by Chung (1961) for the almost sure convergence by proving that the uniform 
L l boundedness of the sequence (— log<7o,n(^0) stm holds in this case pro- 
vided that the entropy rate HI(X) is finite. 



Perez (1957) made the first extension of the theorem to an arbitrary state 
space. He proved that if /i. is the infinite product measure and X is stationary, 
then under fmiteness of H(X), convergence in mean of(h n (X)/n) holds. 

Moy (1960,1961) extended it to a homogeneous Markovian measure fj,, 
first under fmiteness of H2(X), and then under fmiteness of Hi(X) and up- 
boundedness of the sequence —Eloggo iTl (x) (equivalent to fmiteness of H(X) 
if the reference measure is a product of independent measures). 

Both proofs follow the lines of McMillan's proof, using Doob's martingale 
theorem and embedding the process in a bilateral (X n ) n ^z process. Let -k { 
denote the n-th coordinate function defined on E N by iXi(x) = x n and let 
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T m%n be the cr-field generated by (7r m , . . . , ir n ) for m < n. The density of 
^x" w ^ tn res P ect to tn e measure K m ' n defined on Tm,n by 

K m > n (A XB)= f im(A I ^ m ,n-l)cflPx, ^ € 5, B € ^ m , n -l 

is proven to be p m , n (^)- If /•* is Markovian, then K m ,0 is an extension 
of K m ,0 to .F^o for all m' < m. And if \x is homogeneous, then 

9m,n{ x ) = & n 9m-n,o( x )- 

Following Gallager's method, Kieffer (1974) gave a simpler proof of the 

same result. 

Perez (1964) reviewed, applied and generalized the previous extensions of 

the AEP in terms of relative entropy between processes. 
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Fig. 2: Extensions of the AEP in terms of state space; discrete-time processes. 
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7.3.3 



Extensions in terms of measures 



Other extensions have been made in the direction of weakening the assump- 
tions of stationarity of the process and of Markovian type of the reference 
measure. 

Let us set v = Px for simplification. Jacobs (1962) proved that if v <C v' 
and if the AEP holds for u' , then it holds for v too, for a finite state space. Gray 
& Kieffer (1980) extended it to the case where v is asymptotically dominated 
by a stationary measure v' , in the sense that 



i/(F) = =► u(0- n F) — ► 0, 



n 



+oo, 



and the strong AEP holds for v' . It allows them to use a generalized version 
of the ergodic theorem and to prove the AEP for v both in mean and almost 
surely. Barron (1985) extended it to a Borel state space for the almost sure 
convergence, with a Markov of order m > reference measure. 

Klimko & Sucheston (1968) proved the mean AEP for an irreducible Markov 
chain with a countable state space and an infinite invariant measure, under sev- 
eral additional conditions. 

Wen & Weiguo (1995,1996) proved the AEP for a non-homogeneous Markov 
chain with a finite state space, using the particular form of H n for this case and 
proving that 



1 
n L 



n [E\ 

h n (X) + ^2Y^Pk( x k-iJ)^gp k (X k -i,j) 

k=l j=l 



0, n — > +oo, 



where Pk(i,j) = P(-X"fc = j \ X n _i = j) are the transition probabilities of the 
chain. 
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Fig. 3: Extensions of the AEP in terms of measures; discrete-time processes. 

7.3.4 Extensions to continuous-time processes 

Perez (1957) showed the mean AEP for ergodic stationary for discrete as 
well as for continuous time processes with a measurable state space and for 
the product reference measure, under fmiteness of limr_ > _)- 0O ]Hl7'(X)/T (or 
up-boundedness ofthe sequence Wt(X-)/T). 

Pinsker (1960) extended it (via a discretization procedure and using McMil- 
lan's proof) to conditions amounting to homogeneity and Markovian properties 
ofthe reference measure for a finite state space. 

Kieffer (1974) extended it to a Borel state space, using Gallager's method. 

Bad Dumitrescu (1988) showed the mean convergence for a pure jump 
Markov process with a finite state space by using Perez (1957) and a con- 
vergence result of Albert (1962) on the number of transitions from one state to 
another. She proved the fmiteness of limj'_,+ 0O BI:r(X)/T by writing explic- 
itly the likelihood ofthe associated renewal Markov process with respect to the 
product ofthe Lebesgue measure and the counting measure on E®, say //*. 



Girardin & Limnios (2001b) extended the mean and strong AEP to an ir- 
reducible positive recurrent semi-Markov process with a finite state space and 
a semi-Markov kernel absolutely continuous with respect to the Lebesgue mea- 
sure on R + with derivative qij such that (log qij) is uniformly L 1 (R + )-bounded 
and with m* < +oo, where rrii denotes the mean sojourn time in state i, i.e., 



m, : 



r+oo f 

= / i-^Qait) 
Jo L * J 



dt, i € E. 



The proof uses the likelihood of the associated renewal Markov process with 
respect to fj,* via a generalization to these processes ofthe convergence result 
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of Albert (1962). Note that any irreducible positive recurrent semi-Markov 
process with a finite state space is ergodic.The case of a pure jump Markov 
process is derived as a particular case. 

The generalization to a countable state space is straightforward under finite - 
ness conditions. 

Under similar hypothesis, the strong and mean AEP for the entropy of a 
semi-Markov processes relative to another is proven too in Girardin& Limnios 
(2001b). The reference measure fi is then the distribution of a semi-Markov 
process too, that is to say a semi-Markovian measure. 
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Fig. 4: Extensions of the AEP; continuous-time processes. 

7.4 Explicit expressions of the entropy rate 

For some kinds of processes, as the Markovian or gaussian processes, the 
entropy rate H(X) has an explicit form. 

It has been first defined by Shannon (1948) for an ergodic Markov chain 
with a finite state set as the sum of the entropies of the transition probabilities 
(pij)j weighted by the probability of occurrence of each state according to the 
stationary distribution, namely 



H ( X ) = ~ 53 Ti^Pij log Pij, 



(4.1) 
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and he proved the AEP then. The entropy rate of a positive recurrent chain 
with a countable state space takes this form too. 

Krengel (1967) proved that the entropy rate of a null recurrent chain with a 
countable state space is still given by (4.1), if i\ denotes an invariant measure 
of the chain (with T ^ i 7Tj = +oo). 

For a semi-Markov process, under suitable hypothesis, Girardin & Limnios 
(2001b) showed that 



H(X) = 



Eij^iS(QijlA) 



where ^denotes the stationary distribution of (J n ). 

The relative entropy rate between two semi-Markov processes X and Z is 

2^e v t m i 

where X and Y are semi-Markov processes as above, with R denoting the 
semi-Markov kernel of Y. 

The entropy and relative entropy rates of irreducible ergodic finite pure jump 
Markov processes X and Y defined in Bad Dumitrescu (1988) are obtained as 
special cases of semi-Markov processes, namely 

JH(X) = - ]jP 7Tj ^2 a ij lo S a ij + yVj]P aij 

i jjti i j^i 

and 

H(X | Y) = ]T m J2 Uj log ^ + ay - b i:i ) , 

where A and B denote the respective infinitesimal generators of X and Z, and 
it is the stationary distribution of X (i.e., irA — 0). 

An I? process X = (X n ) ne ^ is weakly stationary if its covariance function 
is invariant with respect to shifts of time. Its entropy at time n is then less than 
the entropy of the Gaussian process Y with the same n-dimensional covariance 
matrix T n . To be specific, H n (X) < H n (Y),with 

1 n 

Hn(Y) = - log Detr n + - log(27re), 

see for example Choi (1987), and 
DetT n 



n 



exp / logh(X)d\ — expT(Y), 
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where h is the spectral density of the Gaussian process. Hence for any Gaus- 
sian stationary process Y, 



H(Y) = log y/2ir(e + 1) + ^X(Y). 

The quantity 

J(Y)= f log h(\)d\ 

is the Burg entropy ofY. It is also the limit of another sequence. Indeed, if ajf 
(resp. a 2 ) denotes the variance of the linear prediction error oiX n knowing the 
finite past Y n _\, . . . , Y n -N (resp. infinite), then a% = DetrV/DetT/v-i. Due 
to the projection properties and to Szego's theorem (see Grenander & Szego 
(1955)), this yields 

a% — > a 2 = exp J(Y). 

Conclusion 

Extensions to group index sets, different from N or R+ is possible. The AEP 
is proven to hold for the same time spaces as the individual ergodic theorem 
is known to hold when the state space is finite, see Ornstein & Weiss (1983) 
through a proof avoiding martingale arguments. 

The case of a general Borel state space is still to be studied for semi-Markov 
processes. Extension of the AEP to other families of non-stationary processes 
could be considered. 

The Markov nature of the reference measure seems necessary; different at- 
tempts to get ride of it have failed, see both Perez (1964)'s statement and 
counter-example by Kieffer (1974,1976) and Perez (1980) commented by Orey 
(1985). Orey (1985) extends the strong AEP to "nearly Markovian" measures, 
a notion too complicated to be developed here, but which seems to constitute 
the limit of extension in this direction. 

The real minimal hypothesis on the process and the reference measure for 
the AEP to hold is still an open question. 

References 

Albert, A. Estimating the infinitesimal generator of a continuous time finite state Markov pro- 
cess. Ann. Math. Stat. V38, pp727-53 (1962). 

Algoet, P. H. The strong law of large numbersfor sequential decisions under uncertainty. IEEE 
Trans. Inform. Theory, V40, pp609-33 (1994). 

Algoet, P. H. & Cover, T. M. A sandwich proof of the Shannon-McMillan-Breiman theorem. 
Annals Prob., V16, pp899-909 (1988a). 

Algoet, P. H. & Cover, T. M. Asymptotic optimality and asymptotic equirepartition properties 
oflog-optimum investment. Annals Prob., V16, pp876-898 (1988b). 

Anderson, W. J. Continuous-Time Markov Chains. Springer-Verlag, New- York (1991). 



1 78 RECENTS ADVANCES IN APPLIED PROBABILITY 

Ash, R. A. Information Theory. Intersciences, New York (1965) republication: Dover, 
New York (1994). 

Bad Dumitrescu, M. Some informational properties of Markov purejump processes. Cas. Pesto- 
vani Mat. VI 13, pp429-34 (1988). 

Barren, A. The strong ergodic theorem for densities: generalized Shannon-McMillan-Breiman 
theorem. Ann. Probab., V13, ppl292-1303 (1985). 

Billingsley, P. Ergodic Theory and Information. R. E. Krieger Publishing Co, Huntington (1978). 

Breiman, L. The individual ergodic theorem of information theory. Ann. Math. Stat., V28, 
pp809-l 1(1957). 

Breiman, L. Correction to: the individual ergodic theorem of information theory. Ann. Math. 
Stat, V31,pp809-10 (1960). 

Carleson, L. Two remarks on the basic theorems of information theory. Math. Scand., V6, 
ppl75-80 (1958). 

Choi, B. S. A proof of Burg's theorem, in MAXIMUM ENTROPY AND BAYESIAN SPECTRAL 
ANALYSIS AND ESTIMATION PROBLEMS. Eds C.R. Smith & G.J. Ericksonpp75-84 (1987). 

Chung, K. L. A note on the ergodic theorem of information theory. Ann. Math. Stat., V32, 
pp612 14 (1961). 

Cover, L. & Thomas, J. Elements of Information Theory. Wiley series in telecommu- 
nications, New- York (1991). 

Csizar, I. Maxent, mathematics, and information theory, in MAXIMUM ENTROPY AND BA- 
YESIAN Methods. Kluwer Academic Publishers, pp35-50 (1996). 

Donsker, M. D. & Varadhan, S. R. Asymptotic evaluation of certain Markov process expecta- 
tions for large time! Comm. Pure Appl. Math., V18, ppl^47 (1975). 

Doob, J. L. Stochastic Processes. John Wiley & sons, New York, 1953. 

Ellis, Entropy, large deviations and statistical mechanics., Springer- Verlag, 
New- York (1985). 

Feinstein, A. Foundations of Information Theory. McGraw-Hill, New York (1958). 

Gallager, R. G. Information Theory and Reliable Communication. Wiley (1968). 

Garret A. Maximum entropy from the laws of probability, in BAYESIAN INFERENCE AND 
Maximum Entropy Methods in Science and Engineering. M. Mohammad- 
Djafari (Ed.), AIPCP, pp3-22 (2001). 

Girardin, V. Entropy maximization for Markov and semi-Markov processes., submitted (2002). 

Girardin, V. & Limnios, N. Probabilits en vue des applications. Vuibert, Paris (2001a). 

Girardin, V. & Limnios, N. Entropy of semi-Markov and Markov processes. Prepublication 
Paris-Sud Orsay (2001b). 

Gray, R. M. & Kieffer, J. C. Asymptotically mean stationary measures. Annals Prob., V8, 
pp962-73 (1980). 

Grenander & Szego Toeplitz forms and their applications. Chelsea Pub. Co., New 
York (1955). 

Grendar, M. & Grendar, M. What is the question MaxEnt answears? A probabilistic interpre- 
tation, in Bayesian inference and Maximum Entropy Methods in Science 
and Engineering. M. Mohammad-Djafari (Ed.), AIPCP, pp83-93 (2001). 

Guiasu, S. Information Theory with Applications. McGraw-Hill, New York (1977). 

Jacobs, K. Lecture notes on Ergodic Theory. Matematik Institut, Aarhus Univ., Den- 
mark, VI (1962). 

Johnson, O. Entropy and limit theorems. PhD Thesis, Cambridge (1999). 

Khinchin, A. Mathematical Foundations OF Information Theory. Dover, New 
York (1957). 

Kieffer, J. C. A simple proof of the Moy-Perez generalization of the Shannon-McMillan theorem. 
Pacific J. Math. V51, pp 203-06 (1974). 



On the different extensions of the ergodic theorem of information theory 179 

Kieffer, J. C. A counterexample to Perez's generalization of the Shannon-McMillan theorem. 

Annals Prob., VI, pp362-64 (1973) and V4, ppl53-54 (1976). 
Krengel, U. Entropy of conservative transformations. Z. Wahrsch. verw. Geb., V7, ppl61-81 

(1967). 
Kullback, S. Information Theory and Statistics. Peter Smith (1978). 
Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Stat., V29, pp79-86 

(1951). 
Levy, P. Processus semi-markoviens. Proc. Int. Cong. Math. Amsterdam, pp416-26 (1954). 
Limnios, N. & Oprijan, G. Semi-Markov Processes and Reliability. Birkhauser, 

Boston (2001). 
Linnik Y. V. An information-theoretic proof of the central limit theorem with the Lindeberg 

condition. Theory Prob. Appl, V4, pp288-299 (1959). 
McMillan, M. The basic theorems of information theory. Ann. Math. Stat., V24, pp 1 96 2 1 9 

(1953). 
Moran, P. Entropy, Markov processes and Boltzmann 's H-theorem. Proc. Camb. Philos. Soc. 

V57, pp833-42 (1961). 
Moy, S.-T '., Asymptotic properties of derivatives of stationary measures. Pacific J. Math., V10, 

ppl371-83 (1960). 
Moy, S.-T, Generalisations of Shannon-McMillan theorem. Pacific J. Math., VI 1, pp705— 14 

(1961). 
Orey, S. On the Shannon- Perez-Moy theorem. Contemp. Math. V41, pp3 1 9—27 (1985). 
Ornstein, D. & Weiss B. The Shannon-McMillan-Breiman theorem for a class of amenable 

groups. Israel J. Math, V44, pp53-60 (1983). 
Parthasarathy, K. R. A note on McMillan 's theoremfor countable alphabets, in Inf. Theory, Stat. 

Decision Functions, Random Processes, pp541— 543 Prague (1964). 
Perez, A. Sur la convergence des incertitudes, entropies et informations echantillon (sample) 

vers lew vraies. Trans. First Prague Conf. Inf. Theory, Stat. Decision Functions, Random 

Processes, pp209-243, Prague (1957). 
Perez, A. Extensions of Shannon-McMillan 's limit theorem to more general stochastic pro- 
cesses, in Inf. Theory, Stat. Decision Functions, Random Processes, pp545— 574 (1964). 
Perez, A. On Shannon-McMillan's limit theoremfor pairs of stationary random processes. Ky- 

bernetika, V19, pp301-14 (1980). 
Pinsker, M. S. Information and Information Stability of Random Variables 

and Processes. Moscow (1960), Holden-Day, New York (1964). 
Reza, F. An Introduction to Information Theory. McGraw-Hill, New York (1961), 

republication: Dover, New-York (1994). 
Shannon, C. A mathematical theory of communication. Bell Syst, Techn. J, V27, pp379-423, 

623-656 (1948). 
Shields, P. C. The ergodic and entropy theorems revisited. IEEE Trans. Inf. Theory, V33, pp263- 

66 (1987). 
Smith, W.L. Regenerative stochastic processes. Proc. Roy. Soc. London, Ser. A, V232, pp6-31 

(1955). 
Wen, L. & Weiguo, Y. A limit theoremfor the entropy density of nonhomogeneous Markov 

information source. Stat. Prob. Letters, V22, pp295— 301 (1995). 
Wen, L. & Weiguo, Y. An extension of Shannon-McMillan theorem and some limit properties 
for nonhomogeneous Markov chains. Stoch. Proc. Appl, V61, ppl29^15 (1996). 



This page intentionally left blank 



DYNAMIC STOCHASTIC MODELS FOR INDEXES 
AND THESAURI, IDENTIFICATION CLOUDS, AND 
INFORMATION RETRIEVAL AND STORAGE 



Michiel Hazewinkel 
cm 

P.O. Box 94079 
1090GB Amsterdam 
The Netherlands 
mich@cwi.nl 



Abstract The first topic of this partial survey paper is that of the growth of adequate lists of 

key phrase terms for a given field of science or thesauri for such a field. A very 
rough 'taking averages' deterministic analysis predicts monotonic growth with 
saturation effects. A much more sophisticated realistic stochatic model confirms 
that. 

The second, and possibly more important, concept in this paper is that of 
an identification cloud of a keyphrase (or of other things such as formulas or 
classification numbers). Very roughly this is (textual) context information that 
indicates whether a standard keyphrase is present, or, better, should be present, 
whether it is linguistically recognizable or not (or even totally absent). Identi- 
fication clouds capture a certain amount of expert information for a given field. 
Applications include automatic keyphrase assignment and dialogue mediated in- 
formation retrieval (as discussed in this paper). The problem arises how to gen- 
erate (semi-)automatically identification clouds and a corresponding enriched 
weak thesaurus for a given field. A possible (updatable and adaptive) solution is 
described. 
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8.1 Introduction 

The first topic of this paper is concerned among others with the follow- 
ing question. Suppose one has made an index or thesaurus for a given (su- 
per)specialism like for instance discrete mathematics (understood as combina- 
torics) on the basis of a given corpus, like the two (leading?) journals 'Discrete 
Mathematics' and 'Applied Discrete Mathematics'. How does one tell that the 
index made is more or less complete, i.e. more or less good enough to describe 
the field in question. And, arising from that, are we really dealing with leading 
journals (as the publisher, in this case Elsevier, believes). As a matter of fact, 
indexes for the two journals named have been made, [Hazewinkel, 2000; Ha- 
zewinkel, 2001] and a very preliminary analysis, [Rudzkis, 2002], indicates 
that they go some way towards completeness. 

One way to tackle this is to test the collection obtained against another cor- 
pus. However, such a second corpus may not be available. And if it were 
available one would like to use it also for key phrase extraction in order to ob- 
tain an index/thesaurus that is as complete as possible and the same problem 
comes back for the new index/thesaurus based on all material available. 

Another way to try to deal with the question is to watch how the index/the- 
saurus grows as more and more material is processed. If, as one would intu- 
itively expect, eventually saturation phenomena appear, that is a good indicator, 
that some sort of completeness has been reached. To deal with this not only 
qualitatively but also quantitatively, a dynamic stochastic model is needed, to- 
gether with appropriate estimators. This is the first topic addressed in this 
paper. 

The second topic deals with information retrieval and automatic indexing. 
These matters seem to have reached a certain plateau. As I have argued at 
some length elsewhere, see, e.g., [Marcantognini, 2000; Marcantognini, 2001; 
Woerdeman, 1989; Hazewinkel, 1999b] there is only so much that can be done 
with linguistic and statistical means only. To go beyond, it could be necessary 
to build in some expert knowledge into search engines and the like. This has 
led to the idea of identification clouds, which is one of the topics of this paper. 

The same idea grew out of a rather different (though related) concern. It is 
known and widely acknowledged, that a thesaurus for a given field of inquiry 
is a very valuable something to have. However, a classical thesaurus according 
to ISO standard 2788, see [Arocena, 1990], and various national and interna- 
tional multilingual standards, is not an easily incrementally updatable struc- 
ture. Indeed, keeping up to date the well known thesaurus EMBASE, [Burg, 
1975; Castro, 1986], which is at the basis of Excerpta Medica, takes the full 
time efforts of four people. This problem of semi-automatic incremental up- 
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dating of a thesaurus has lead to the idea of an enriched weak thesaurus, [Mar- 
cantognini, 2001; Hazewinkel, 1999b], and identification clouds are a central 
part of that kind of structure. 

In the second part of this paper I try to give some idea of what ID clouds 
are and how they can be used. More applications can be found in the papers 
quoted. The idea has meanwhile evolved, largely because of the use of ID 
clouds in the EC project TRIAL SOLUTION, [Dahn, 1999], and in this pa- 
per I also sketch the refinements that have emerged, and indicate some open 
problems that need to be solved if this approach is to be really useful. 

This paper is an outgrowth of the lecture I gave on (some of) these matters 
at the IWAP 2002 meeting in Caracas, Venzuela, January 2002. I thank the 
organizers of that meeting for that opportunity. 

8.2 A First Preliminary Model for the Growth of Indexes 

The problem considered in this section is how a global index, a list of terms 
supposed to describe a given field of enquiry, evolves as indexing proceeds and, 
simultaneously, the field develops (at a far from trivial pace). The questions 
arises how does such an index evolve chronologically (assuming, for simplic- 
ity, that the indexing is also done chronologically), and, most important, how 
does one judge on the basis of these data whether the index generated is ade- 
quate for the field in question or not. 

Here is a very simple (and naive) stochastic model for this situation and a 
preliminary (deterministic) analysis of it. At starting time (time zero) there is 
an (unknown) collection, K(0), of key phrases that is adequate for the field in 
question. In addition there is an infinite universe of potential terms that can be 
dreamed up by authors and others of new (important) key phrases. Thus, from 
the point of view of indexing and thesauri the field grows as: 

K(t + l) = K{t)L)B{t), 

where the union is disjoint and B(t) is the collection of new terms generated 
in period t. These are not yet known (i.e. identified/recognized), but they do 
exist in one form or another in the corpus as it exists at time t. 

Now let indexing start. At time zero no terms have been identified. Let X(t) 
stand for the set of terms recognized (found) at time t, X(t) C K(t). Hence 
X(0) = 0. A generalization would be that one starts with an existing thesaurus 
and tries to bring it up-to-date; then X(0) is a known subset ofK(0). 

The indexing proceeds as follows. At time t a set of terms S(t) is selected 
(found, recognized) and added to X(t). This set S(t) consists of two parts, 
S(t) = A(t) U C(t), A{i) C K(t), C(t) C B(t), A(t) D C{t) = 0. Thus 

X(t + 1) = X(t) U S(t) C K(t + 1). 
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As a rule, of course, part of A(t) is already in X(t). The main problem is to 
have criteria or estimates to decide whether eventually X(t) exhausts K(t) or, 
K(t — t) for a suitable dealy r, or not. For instance in the form 

x(t) 
y(t) = jj-r -*1, as t -» oo, 

where x(t) is the cardinality of X(t) and similarly for k(t). The (only) basic 
observable is S(t) and deriving from that X(t). 

Let us do some rather crude average reasoning. First, let us assume linear 
growth of the field of science in question: 

k(t) = fc(0) + tv 

for some constant v. Also on average u terms are selected (per period) with a 
fraction x{t)/k(t) coming from known stuff, and a fraction (k(t) — x{t))/k(i) 
new terms. There results a recursion equation for x(t): 

s(i+l) = a(i) + u(l-|j| 

Let y(t) = x(t)/k(t) be the fraction of terms covered by the thesaurus at this 
time. Then 

Assume that the differential equation 

u (u + v)y(t) 



y' = 



k(t + l) k(t + l) 



approximates the difference equation above well enough (which is certainly the 
case). This differential equation is actually explicitly solvable and the solution 
is: 

__ _u u{k + y) l +^M 

V ^'~u + v~{u + v)(k + (t + l)u)i+(«/») ' 

where k = k(0). So 

lim y(t) = -^— (2.1) 

t-*oo U + V 

and y(t) grows monotonically from to the asymptotic limit value u/u + v. 

In particular the recognized fraction of relevant (latent) index terms does not 
approach one as long as the field keeps growing, and it grows slowly (compared 
to the indexing rate) once one gets very close to the asymptotic limit. Note 
also that the saturation phenomenon alluded to in the introduction does indeed 
occur. 
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Of course this is quite primitive. Frequently, replacing stochastic phenom- 
ena with averages (in a nonlinear case) does not work. So a more sophisticated 
anlysis of this kind of stochastic processes - apparently a new kind - is needed. 
This is described in the next section. 

8.3 A Dynamic Stochastic Model for the Growth of 
Indexes 

Using the same notations as above the basic assumptions of the model are 
as follows. 

• The x(t), the cardinalities of the sets of key phrases identified up to and 
including time t, form a random Poisson process. That is, the increments 
Ax(t) = x(t) — x(t — 1) are independent random variables with a Pois- 
son distribution P\ t . For simplicity x(0) is assumed to be a deterministic 
quantity. Let n(t) = E,x(t), then \ t = An(t). 



The key phrases are numbered consecutively as they appear in time. 
A key phrase Wk £ K{t) at the time of its emergence has attached to 
it a random weight Wk that reflects its relevance (= importance) at that 
time. The Wk are supposed to be i.i.d. positive random variables with a 
distribution function F independent of the sequence x(t), and EWk = 1. 

As before let S(t) be the set of key phrases that were observed at time 
t and let Ak,t — {wk € S(t)}. The probabilities of the random events 
Ak t t depend on the random weights Wk and the history so far, I t , of the 
system considered. Assume that for fixed Wk and I t , the events Ak t, 
k = 1, L, x{t) are conditionally independent and that the following 
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equalities hold 

P{A ktt | W ( .)} =min{^} *f 7r fcit . (3.1) 

Here i^ = ES'(i) is a deterministic function that reflects the importance 
of the corpus used. This (3.1) is quite a weak assumption, practically 
dictated by the way indexes and thesauri grow in practice. 

The results to be quoted below are some of the ones in [Hazewinkel & 
Rudzkis, 2001] and concentrate on the case that W k = 1. Obviously, much 
more general models should be examined. For one thing the importance of a 
key phrases is certainly not a constant and, moreover, is likely to change in 
time. 

Set 

h{t) = -Ex®, a = E^- y 
then, besides other asymptotic results, assuming At = A, Ut = u 

x(t) a 



lim E 

t—KX> 



k(t) A 







which in the case that W^ = 1 is precisely the result (2.1) of the crude "taking 
averages" analysis of Section 2 above. It remains to be sorted out what happens 
in more general circumstances. 
There is also an exhaustion result: 

lim P{K(0) C S(t)} = 1 & V -^- = oo 
t->oo WJ *-f n{t) 

which means that if the observation rate is not too small compared to the 
growth rate of the field then, eventually, the (latent) key phrases at time zero 
will all be found. 

Shifting time this means that for any time t a certain amount of time later 
all potential key phrases K(t) will have been recognized with probability 1. 
What is still needed is an estimate of how much time that will take (depending 
of course on growth and observation rates). 

For a number of statistical estimators of the parameters of the model see loc. 
cit. 

8.4 Identification Clouds 

Now suppose that we have a near perfect list of key phrases for, say, math- 
ematics. That is not the case, but adequate lists do exist for certain subfields, 
[Kailath, 1986; Sz-Nagy, 1970; Schur, 1986; Hazewinkel, 2000; Hazewinkel, 
2001; Hazewinkel, 2001a; Hazewinkel, 2002]. 
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Even then there remain most serious open problems of information storage 
and retrieval. To start lets look at an example. Here is a phrase that occurred 
in an abstract that came my way for indexing purposes some 6 years ago: 



"... using the Darboux process the complete structure of the solutions of the 
equation can be obtained," 



At first sight, speaking linguistically, it looks like there is here a perfect natural 
key phrase to be assigned, viz. "Darboux process". Presumably, some sort 
of stochastic process like "Cox process", "Gallon-Watson process", "Dirichlet 
process", or "Poisson process". 

However, there is no concept, or result, or anything else in mathematics 
that goes by the name "Darboux processs". Also the context did not look like 
having anything to do with stochastics and/or statistics. Had the abstract been 
classified - it wasn't - using the MSCS (Mathematics Subject Classification 
Scheme) it would have carried a number like 58F07 (1991 version) or 37J35 
(2000 version), neither of which have anything to do with stochastics. 

The proper name "Darboux" is also not sufficient to identify what is meant; 
there are too many terms with "Darboux" in them: "Darboux surface", "Dar- 
boux Baire 1 function", "Darboux property", "Darboux function", "Darboux 
transformation", "Darboux theorem", "Darboux equation",... (these all come 
from the indexes of [Landau, 1987]). 

Or take the following example from [Smeaton, 1992]. Suppose a querier 
is interested in "prenatal ultrasonic diagnosis". Then texts containing phrases 
like "in utero sonographic diagnosis", "sonographic detection of fetal ureteral 
obstruction", "obstretic ultrasound", "ultrasonics in pregnancy", "midwife's 
experience with ultrasound screening" should also be picked up. Or, inversely, 
when assigning key-phrase metadata to documents, the documents containing 
these phrases should also receive the standard controlled key phrase "prenatal 
ultrasonic diagnosis". 

One way to handle such problems (and a number of other problems, see 
below) is by means of the idea of identification clouds. 

Basically the "identification cloud" of an item from a controlled list of stan- 
dardized key phrases is a list of words and possibly other (very short) phrases 
that are more or less likely to be found near that key phrase in a scientific text 
treating of the topic described by the key phrase under consideration. 

For instance the key phrase 

Darboux transformation 
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could have as (part of its) identification cloud the list 

soliton 

dressing transformation 
Liouville integrable 
completely integrable 
Hamiltonian system 
inverse spectral transform 
Backlund transformation 
KdV equation 
KP equation 
Toda lattice 
conservation law 

inverse spectral method 
exactly solvable 

(37J35, 37K (the two MSC2000 classification codes for this area of 
mathematics)) 

And in fact this particular identification cloud solves the "Darboux process" 
problem above. The surrounding text contained such words as 'soliton', 'com- 
pletely integrable', and others from the list above. The appropriate index 
phrase to be attached was "Darboux transformation". 

What the authors of the abstract meant was something like "repeated use of 
the process 'apply a Darboux transformation' will give all solutions". 

A human mathematician, more or less expert in the area of completely in- 
tegrable systems of differential equations, would have no difficulty in recog- 
nizing the phrase "Darboux process" in this sense. Thus what identification 
clouds do is to add some human expertise to the thesaurus (list of key phrases) 
used by an automatic system. 

The idea of an identification cloud is part of the concept of an enriched weak 
thesaurus as defined and discussed in [Marcantognini, 2001; Rudin, 1979; Ha- 
zewinkel, 1999b]. 

8.5 Application 1: Automatic Key Phrase Assignment 

A first application of the idea of identification clouds is the automatic as- 
signment ofkey phrases to scientific documents or suitable chunks of scientific 
texts. 

It is simply a fact that it often happens that in an abstract or chunk of text a 
perfectly good key phrase for the matter being discussed is simply not present 
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or so well hidden that linguistic and/or statistical techniques do not suffice to 
recognize it automatically. 

The idea here is simple. If enough of the identification cloud of a term 
(= standard keyphrase) is present than that key phrase is a good candidate at 
least for being assigned to the document under consideration. 

Here are two examples. 

8.5.1. Example 

Two-dimensional iterative arrays: characterizations and applications. 

We analyse some properties of two-dimensional iterative and cellular ar- 
rays. For example, we show that arrays operating in $T(n)$ time can be sped 
up to operate in time $n+(T(n)-n)/k$. 

computation. Unlike previous approaches, we carry out our analyses using se- 
quential machine characterizations of the iterative and cellular arrays. Con- 
sequently, we are able to prove our results on the much simpler sequential 
machine models. 

iterative array 

sequential characterization of cellular arrays 
sequential characterization of iterative arrays 
characterization of cellular arrays 
characterization of iterative arrays 

mnrays of processors 



Here the available data consisted of an abstract (which is only partially repro- 
duced here). In bold, in the abstract itself, are indicated the index (thesaurus) 
phrases which can be picked-out directly from the text. Below the original text 
are five more phrases, that can be obtained from the available data by relatively 
simple linguistic means, assuming that one has an adequate list of standard key 
phrases available. For instance "sequential characterization of cellular arrays" 
and "sequential characterization of iterative arrays" result from the phrase in 
italics in the abstract fragment above. Note that instead of doing (more or less 
complicated) linguistic transformations, these could also have been obtained 
by means of identification clouds. There are advantages in this because there 
are so very many possible linguistic transformations. 

Then, in shadow, there is the term "array of processors". This one is more 
complicated to find. But, given an adequate standard list, and with "array", 
"processors" and "machine" all in the available text, it is recognizable, using 
identification clouds, as a term that belongs to this document. 
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Finally, in bold-shadow, there is the key phrase "speed-up theorem" a well 
known type of result in complexity theory. In the text there just occurs "sped 
up". Certainly, unless one has a good list of (standard) key phrases available, 
this would be missed. Also purely linguistic means plus such a very good list 
are clearly still not sufficient; there is no way that one can have a key phrase 
extraction rule like 'if "sped up" occurs "speed-up theorem" is a likely key 
phrase'. However, "sped up" plus supporting evidence from the context in the 
form of a sufficient number of terms from the identification cloud of "speed-up 
theorem" being present, would do thejob. 

8.5.2. Example 

Sequential and concurrent behaviour in Petri net theory. 

Two ways of describing the behaviour of concurrent systems have widely 
been suggested: arbitrary interleaving and partial orders. Sometimes the 
latter has been claimed superior because concurrency is represented in a 'true' 
way; on the other hand, some authors have claimed that the former is sufficient 
for all practical purposes. Petri net theory offers a framework in which both 
kinds of semantics can be defined formally and hence compared with each 
other. Occurrence sequences correspond to interleaved behaviour while the 
notion of a process is used to capture partial-order semantics. This paper 
aims at obtaining formal results about the 

more powerful than inductive semantics using 

of nets which are of finite synchronization and 1-safe. 

sequential behaviour in Petri net theory 

Petri net theory 

axiomatic definition of processes 



U-Sfflff® flNBttS 

The style coding is the same as in the previous example. Here, the constituents 
"1-safe" and "nets" of "1-safe nets" actually occur in the text. But they are so 
far apart that without standard lists and identification clouds the phrase would 
probably not be picked up. The same holds for the key phrase "interleaving 
semantics". 

Afterwards, I checked against the full text whether these extra key phrases 
were indeed appropriate. They were. Two more examples can be found in [Wo- 
erdeman, 1989] or [Hazewinkel, 1999b]. These are all actual examples which 
occurred in the corpora used to produce the indexes [Sz-Nagy, 1970; Schur, 
1986]. 
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A C-program that takes as input a keyphrase list with identification clouds 
and a suitably prepared corpus of documents (chunks of text or abstracts) and 
that gives as output the same corpus with each item enriched with automat- 
ically assigned keyphrases has been written in the context of the EC project 
"TRIAL SOLUTION" (Febr. 2000-Febr. 2003), [Dahn, 1999]. It also outputs 
an html file for human use which can used to check how well the program 
worked. This validation test is currently (2002) under way. 

It is already clear, that the idea of identification clouds needs refinements; 
certainly when used on rather elementary material (as in TRIAL SOLUTION). 
Two of these will be briefly touched on below. 

8.6 Application 2: Dialogue Mediated Information 
Retrieval 

Given a keyphrase list with identification clouds, or, better, an enriched 
weak thesaurus, it is possible to use a dialogue with the machine to refine 
and sharpen queries. Here is an example of how part of such a dialogue could 
look: 

(Query:) I am interested in spectral analysis of transformations? 
(Answer:) I have: 

• spectral decompositions of operators in Hilbert space (in do- 
main 47, operator theory, 201 hits) 

• spectral analysis (in domain 46, functional analysis, 26 hits) 
spectrum of a map (in domain 28, measure theory, 62 hits) 
spectral transform (in domain 58, global analysis, 42 hits) 



• 



• 



• 



inverse spectral transform (in domain 58, global analysis, 
405 hits) 

Please indicate which are of interest to you by selecting up to five 
of the above and indicating, if desired, other additional words or 
key phrases. 

The way this works is that the machine scans the query against the available 
identification clouds (using some (approximate) string matching algorithm, 
e.g., Boyer-Moore) and returns those keyphrases whose ID clouds match best, 
together with some additional information to help the querier make up his 
mind. 
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8.7 Application 3: Distances in Information Spaces 

As it is, the collection of standard keyphrases is just a set. It is a good idea to 
have a notion of distance on this set: are two selected standard key phrases near, 
i.e. closely related, or are they quite far from each other. Identification clouds 
provide one way to get at this idea: two phrases which have large overlap in 
their identification clouds are near to each other. 

A use of this, again dialogue mediated, is as follows. 

(Query:) I am interested in something related to <StandardKeyPhrase 
1>. Please give me all standard keyphrases that are within dis- 
tance x of this one. 

For other ways to define distances on information spaces (such as the space of 
standard key phrases) and other potential uses of distance, see [Hazewinkel, 
1999b]. 

A distance on the space of key phrases is related to a distance on the space 
of documents, see loc. cit. This is also most useful in dialogue mediated 
querying. Suppose a really good document for a given query has been found. 
Than a very useful option is 

(Query:) I am interested in documents close to <Document 1>. Please 
give me all standard documents that are within distance x of this 
one and which have two or more of the following key phrases in 
their key phrase metadata field. 

Some search engines have a facility like this in the form of a button like 'similar 
results' in SCIRUS of Elsevier. But not based on distances in information 
spaces. 

8.8 Application 4: Disambiguation 

Ambiguous terms are a perennial problem in (automatic) indexing and the- 
saurus building. 

Identification clouds can serve to distinguish linguistically identical terms 
from very different areas of the field of inquiry in question. E.g., "regular ring" 
in mathematics, or the technical term "net" which has at least five completely 
different meanings in various parts of mathematics and theoretical computer 
science. For instance 'transportation net' in optimization and operations re- 
search, 'net of lines' in differential geometry, 'net' in topology (which replaces 
the concept of a sequence in topological spaces where the notion of sequence 
is not good enough), 'communication net', 'net(work) of automata',.... 

Identification clouds also serve to distinguish rather different instances of 
the same basic idea in different specializations. E.g., spectrum of a commu- 
tative algebra in mathematics, spectrum of an operator in a different part of 
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mathematics, and spectrum (of a substance) in physics or chemistry are dis- 
tantly related and ultimately based on the same idea but are in practice com- 
pletely different terms. 

Possibly an even worse problem is caused by phrases and words which have 
very specific technical meanings but also occur in scientific texts in everyday 
language meanings. A nice example is the technical concept "end" as it occurs 
in group theory, topology and complex function theory (three technically dif- 
ferent though related concepts). Searching for "end" in a large database such 
as MATH of FIZ/STN (Berlin, Karlsruhe) is completely hopeless. Searching 
for "end" together with its ID cloud for its technical meaning in group theory 
would be a completely different matter. Note that specifying group theory as 
well in the query would not help much; there are simply too many ways in 
which the word 'end' occurs (end of a section, to this end, end of the argu- 
ment, end of proof, ...). There are many more words like this; also phrases. 
For instance 'sort' (as in many sorted languages or sorting theory) and 'bar' 
(as in bar construction). For more about the 'story of ends', see [Woerdeman, 
1989]. 

8.9 Application 5. Slicing Texts 

One important thing made possible by modern electronic technology, i.e. 
computers and the internet, is the systematic reuse of (educational) material 
and the composing of books and documents exactly taylored to the needs of an 
individual user. For instance a teacher may like the introduction to the idea of 
a topological space from book 1, consider the formal definition ofbook2 better 
and may want to use some examples from book3, some exercises from book4, 
and some historical comments from book5. 

The question arises how to chop up a longer text into chunks (slices) that 
can be efficiently recombined to form such individually taylored texts. This is 
the subject of the EC Framework 5 project TRIAL SOLUTION (Febr. 2000- 
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Febr. 2003), [Dahn, 1999]. If the to be sliced document is well structured, for 
instance composed using LaTeX2e, the structure imposed by the author is a 
good guide where to slice and this is what TRIAL has so far concentrated on. 

Now suppose we have a long section (slices should be relatively short; cer- 
tainly not more than one computer screen) or an unstructured text, i.e. no clear 
markings indicating sections, subsections, etc., the exact opposite of a good La- 
TeX2e document. Suppose also that key phrases have been found and marked 
in the text and that for each key phrase the evidence for including that key 
phrase has also been marked; i.e. for each key phrase the corresponding items 
from its identification cloud have been marked. Treating the text as a long 
linear string we get a picture like the following. 

The numbered fat hollow circles are key phrases in the text which is depicted 
as a fat horizontal line running over four lines; the arrows connect a key phrase 
to a member of its identification cloud. If the key phrase is not actually present, 
the fat circle is the centre of mass of the terms indicating its virtual presence. 
An arrow can run over more than one line; then labels are used to indicate how 
it continues. 

It is now natural to cut the text at those spots where the number of arrow 
lines is smallest. For instance, at the three points indicated by fat vertical lines. 
This can be done at several levels to get a hierarchical slicing. To be able to 
do this optimally one needs a good stochastic model for the distribution of key 
phrases through a text and also for the distribution of identification cloud items 
for a key phrase. 

The problem of slicing a text into suitable chunks also comes up in other 
contexts. For instance in the matter of automatic generation of indexes and 
identification clouds, see Section 17 below, and in the topic of text mining, 
see [Visa, 2001], p. 7. 

8.10 Weights 

One thing that emerged out of the use of identification clouds in the project 
TRIAL SOLUTION was that it is wise to give weights (numbers between 
and 1 adding up to 1) to the elements making up an identification cloud. 

Here is an example: 
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<KEYPHRASE NAME=<Burgers-Gleichung> THRESHOLD=<0.67» 

<WORD VALUE=<Burgers-Gleichung> WEIGHT=<0.7» 

<WORD VALUE=<Burgers> WEIGHT=<0.4» 

<WORD VALUE=<Gleichung> WEIGHT=<0.2» 

<WORD VALUE=<Boussinesq> WEIGHT=<0.025» 

<WORD VALUE=<nichtlinear> WEIGHT=<0.025» 

<WORD VALUE=<Evolutionsgleichung> WEIGHT=<0.025» 

<WORD VALUE=<Solitonlosung> WEIGHT=<0.025» 

<WORD VALUE=<Transformation> WEIGHT=<0.025» 

<WORD VALUE=<KdV> WEIGHT=<0.025» 

<WORD VALUE=<sinh> WEIGHT=<0.025» 

<WORD VALUE=<Gordon> WEIGHT=<0.025» 

<WORD VALUE=<Hirota> WEIGHT=<0.025» 

<WORD VALUE=<Kadomzev> WEIGHT=<0.025» 

<WORD VALUE=<Pedviashwili> WEIGHT=<0.025» 

<WORD VALUE=<Soliton> WEIGHT=<0.025» 

<WORD VALUE=<Backlund> WEIGHT=<0.025» 

<WORD VALUE=<inverse spektrab WEIGHT=<0.025» 

<WORD VALUE=<HOPF> WEIGHT=<0.025» 

<WORD VALUE=<COLE> WEIGHT=<0.025» 
<\KEYPHRASE> 

This particular identification cloud is designed to find occurences of the Burg- 
ers equation as it occurs in the area of completely integrable dynamical systems 
(soliton equations, Liouville integrable systems). There are other areas where 
it occurs; a matter which is further discussed in Section 1 8 below. 

Of course if the phrase itself occurs that is enough as reflected by the first 
item in the 'WORD VALUE list'. Note further that the occurrence of "Burg- 
ers" and of "equation" is not quite enough. There is a good reason for that. For 
one thing there is also a concept called "Burgers vector" (in connection with 
torsion in differential geometry); also "Burgers" is a fairly common surname. 
Further "equation" is of such frequent occurence (in mathematics) that it can 
turn up just about anywhere. Thus the occurence of both "Burgers" and "equa- 
tion" in a chunk of text is not enough to decide that "Burgers equation" is a 
suitable key phrase for that chunk. But if three or more of the sort of words 
that belong to completely integrable dynamical systems are also present one 
can be quite sure that it is indeed a suitable key phrase. 

Of course if formula recognition, see Section 14 below, were available one 
would add to the list above 

< WORD VALUE =< u t - u xx - uu x = > WEIGHT = 0.7 > 

(which is the Burgers equation in formula form). 

How to assign weights optimally is a large problem. Obviously this cannot 
be done by hand: a more or less adequate list of standard key phrases for 
mathematics needs at least 150 000 terms. I propose to use, amoung other 
things, something like the following adaptive procedure. 
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Suppose one has an identification cloud of a term consisting of items 1, L, 
n with weights p\, p2, L, p n adding up to 1. Let a subset S C {1,2, L, n} be 
successful in identifying the phrase involved. Then the new weights are: 



For ie S, p'i=Pt 



ItZidSPi 



For i<£S, p'i=pi- rp h 

where r is a fixed number to be chosen, < r < 1. (Note that the new weights 
again add up to 1 ; note also that the i € S increase in relative importance and 
the i £ 5 decrease in relative importance; if S = {l,L,n} nothing happens.) 
This is an adaptation of a reasonably well known algorithm for communication 
(telephone call) routing that works well in practice but is otherwise still quite 
fairly mysterious, [Azencott, 1986; Srikantakumar & Narendra, 1982]. 

8.11 Application 6. Synonyms 

There are a variety of things one can do with identification clouds to handle 
the well known problem of synonyms. 

Suppose there are two synonymous key phrases. Then providing both of 
them with the same identification clouds (including both phrases themselves 
also as items) will cause both of them to be assigned to those documents where 
that is appropriate. This would probably the best way to handle this in most 
circumstances. 

Should, however, one prefer to have have just one standardized key phrase 
this can be handled by having the alternative key phrases in the identification 
cloud of the standardized one with a weight equal or higher than the threshold 
value of the selected standardized key phrase; see Section 10 above for how 
these weights would work. 

8.12 Application 7. Crosslingual IR 

There are a variety of applications of the idea of identification clouds when 
dealing with multilingual situations in information retrieval and storage. Sup- 
pose for instance one has English language key phrases supplied with German 
language identification cloud items. One bit of use one can make of this is to 
attach English language key phrases to German language papers and chunks of 
text. 

Another one is as follows. Suppose we have a German speaking querier 
who is looking for English language documents as in dialogue mediated search 
(Section 6 above). Then the same German identification clouds attached to 
English key phrases permit the machine to handle a German language query. 
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8.13 Application 8. Automatic Classification 

Here "automatic classification" means assigning to a document one or more 
classification numbers from the MSC2000 (Mathematics Subject Classifica- 
tion Scheme, [MSC2000, 1998]), or its precursor MSC1991. For instance 

14M06: linkage 

54B35: spectra 

55M10: dimension theory 

In this setting, instead of key phrases, it is the classification numbers from 
MSC2000 which are provided with information clouds. This also give these 
classification numbers substance and meaning. The terse describtions like the 
three above are far from sufficient to indicate adequately what is meant (even 
to experts on occasion). 

Certainly the mere occurrence of the word "linkage" should not be con- 
sidered sufficient to assign a paper or chunk of text the classification number 
14M06. First of all one would like to be sure that the document in question 
is about algebraic geometry, this can be done by referring to the identification 
cloud of the parent node 14 (Algebraic geometry), and second one would like 
additional evidence like the presence of such supporting phrases as "complete 
intersection", "determinantal variety", "determinantal ideal", .... 

Inversely, a paper may wery well be about the rather technical group of ideas 
"linkage" without ever mentioning that particular word. 

The other two examples just given also need more complete descriptions 
as to what is really meant (disambiguation and more). For instance there are 
notions of spectrum in many different parts of mathematics: combinatorics, 
number theory (two different ones at least), homological algebra, ordinary and 
partial differential equations, dynamical system theory, harmonic analysis, op- 
erator theory, general topology, algebraic topology, global analysis, statistics, 
mechanics, quantum theory, .... Most are somehow related to the original 
idea of the spectrum of a substance as in physics/chemistry; but some others 
are completely different. 

The exact phrase "dimension theory" occurs four times in MSC2000 while 
the stem "dimension" occurs no less than 94 times. 

8.14 Application 9. Formula Recognition 

Recognizing (or finding) formulas in scientific texts is (in any case at first 
sight) a completely different matter from recognising or finding key phrases. 
First because formulas are two dimensional and second because the symbols 
occurring in formulas are not standardized (except a few like the integral sign 
and the summation sign). Even a standard symbol like n for the number 
3.1415 ... that gives the radius of the circumference of a circle to its diam- 
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eter, is not a reliable guide. The Greek letter it is also often used for, for 
instance, all kinds of mappings in various kinds of geometry, for partitions in 
combinatorics, and for permutations in group theory. 
For instance the two expressions 

f^dz and f'^dt 
Jo x J t 

mean exactly the same thing. It is the pattern rather than the actual glyphs that 
occur which determine what a formula means. 

And even the patterns are not all that fixed. For instance here are a few 
versions of that very well known concept in mathematics and engineering, the 
(one dimensional) Fourier transform (there quite a few more): 

/(£) = J f(x)e-* x dx see [Katznelson, 1968], p. 120 

/oo 
dqf(q) exp(-ipq) see [Wolf, 1979], p. 134 
•oo 

1 f°° 

g(u) = -== / f{x)e~ lux dx see [Wiener, 1933], p. 34 

/+oo 
e- 2inXx f(x) dx see [Schwartzm 1966], p. 176 

-oo 

1 f°° 
F{uj) = — / f{t)e- wt dt see [Levich, 1970], p. 376 

In J-ao 

/oo 
f(t)e~ iut dt see [Hsu, 1967], p. 103 

■oo 
/oo 
fix) exp(-2mxy) dx see [Bakonyi, 1992], p. 45 

-oo 

fix) = f xfdX see [Hewitt & Ross, 1963], p. 359 

Jg 

Most of the variations come from different notations for the exponential, the 
insertion or deletion of normalizing factors involving it, the engineering tradi- 
tion of writing i/~ las j instead of i (as in most of mathematics and physics), 
different notations for integrands, and putting in or leaving out the integration 
limits. 

Still, it is not easy to define formally what kind of transformations are al- 
lowed. On the other hand, trained mathematicians have no difficulty in rec- 
ognizing any of the above (except possibly the last) as instances of a Fourier 
transform. Quite generally trained mathematicians can look at a text in their 
fields of expertise in a language totally unknown to them and still decide what 
topics the text deals with and at what level things are treated just by looking 
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at the formulas. Whether that sort of expertise can be taught to machines is an 
open question. The field of formula recognition is still in its infancy - 1 would 
say it is still in a foetal stage. 

Identification clouds can help. The idea is the same as before. But instead 
of a key phrase it is now a (standardized) formula which has an identifica- 
tion cloud attached to it. In the present case one can imagine that the (obliga- 
tory) presence of an integral sign, the (also obligatory) presence of the function 
symbols 'exp(.)' or 'e~^' and an integration variable 'd' in the formula, plus 
supporting evidence in the form of the occurrence of (some of the) words like 
"transform", "Fourier", "spectral analysis", "harmonic", ... in the surrounding 
text would do not a bad job in identifying Fourier transform formulas. 

Some preliminary work on formula recognition using identification clouds 
is planned in the EC project [Choi, 1986]. 

8.15 Context Sensitive IR 

In a very real sense the idea of identification clouds is that of context sen- 
sitive approximate string recognition. Even if the string itself, that is the key 
phrase in question, is not recognized the context may provide sufficient sup- 
porting evidence to conclude that string should be there as a key phrase. But 
the way the context is used is very much nonsophisticated. There is no (com- 
plicated) grammatical analysis or anything like that. I believe that this is how 
trained scientists function. They just look casually at the surrounding text of, 
say, a formula, and on the basis of what they see there decide what it is all 
about. I do not believe they really do any kind of grammatical analysis or 
transformations. Indeed, many of us are incapable of doing anything like that, 
for very often we have to work in foreign languages which are far from per- 
fectly known to us. 

8.16 Models for ID Clouds 

So far there has been no worry aboutjust how the supporting evidence com- 
ing from identification clouds is distributed. This does not matter too much if 
one is dealing with the problem of assigning key phrases to short chunks of 
text or to abstracts. Say, to documents of the size of one computer screen or 
one A4 page maximal. 

Things change drastically if one has to deal with longer chunks of text and 
expecially if one has to assign key phrases, classifications, and other metadata 
to complete, full text documents. Obviously if the items of an identification 
cloud for some key phrase of classification of formula or ... are spread around 
very far, are very diffuse, or if they are concentrated in jsut a few lines of text, 
makes an enormous difference. 
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Thus what is needed for many applications touched upon in this paper is an 
experimentally justified stochastic model on how the items of an identification 
cloud are distributed. And for that matter, how key phrases, whether actually 
present or not, are distributed over a document. This is of particular importance 
for the application "slicing of documents" discussed in Section 9 above. 

8.17 Automatic Generation of Identification Clouds 

Take a large enough, well indexed corpus, and divide it into suitable chunks 
called documents. For instance take the 700 000 abstracts of articles in the 
STN/FIZ database Math (ZMG data) 1 , or take as documents the sections or 
pages of a large handbook or encyclopaedia such as the Handbook of Theoret- 
ical Computer Science, [van Leeuwen, 1990] or the Encyclopaedia of Math- 
ematics, [Landau, 1987], or an index like [Schur, 1986; Hazewinkel, 2001]. 
Now use a parser for prepositional noun phrases (PNP's) (or an automaton rec- 
ognizing PNP's) or a software indexing program like TExTract or CLARIT, 
[Arocena, 1990A; Arov, 1983; Dym, 1988; Foias, 1990; Gabardo, 1993], to 
generate from these documents a list of key phrases, keeping track of what 
phrases come from what document. Now assign, as ID clouds, to the items 
of the list of keyphrases, those words and phrases found by, say, the software 
indexing program, which occur in the same document as the key phrase under 
consideration. 

8.18 Multiple Identification Clouds 

Picture the set of all documents (chunks of text) in mathematics as a space. 
For instance a discrete metric space as in [Hazewinkel, 1999b]. There may then 
very well be several distinct regions in this space where a given key phrase, 
like "Burgers equation" occurs with some frequency. In this case one may 
well need several different identification clouds for the same key phrase, even 
though there is no ambiguity involved. This happens in fact in the case at hand. 
The Burgers equation has relations with the field of completely integrable sys- 
tems: it itself has soliton solutions and it is also related to what is probably the 
most famous soliton equation, the KdV equation (Korteweg-de Vries equa- 
tion). The identification cloud above in Section 10 was designed to catch this 
type of occurence of the concept. On the other hand it is the simplest nonlinear 
diffusion equation and plays a role as such and in discussions of turbulence. To 
catch those occurrences a rather different set of supporting words and phrases 
is needed (like diffusion, turbulence, eddy, nonlinearity, ...). Just combining 



1 Though this one is not really well indexed in the sense that the key phrases assigned are not from a 
controlled list. However, if the intention would be to generate the controlled list at the same time as the 
correponding ID clouds, this material would be most suitable. 
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the two identification clouds is dangerous because then, by accident, the vari- 
ous different collections of supporting evidence phrases together may combine 
to give a spurious assignment. One can also not concentrate too much on the 
proper name "Burgers" for the reasons mentioned in Section 10 above. 

8.19 More about Weights. Negative Weights 

Another refinement that came out of the experiences with the TRIAL 
SOLUTION project is that it could be a very good idea to allow negative 
weights. Let's look at an example. 

"The next topic to be discussed is that of the Fibonacci numbers. The generating 
formula is very simple. But all in all these numbers and their surprisingly many 
applications are sufficiently complex to make the topic very interesting. Similar 
things happen in the study of fractals." 

Or even worse: 

"These mixed spectrum solutions must be numbered amoung the more complex 
ones of the KdV equation. Still they can be not neglected." 

Both 'complex' and 'numbers' occur in the first fragment of text above (ital- 
ized). But, obviously it would be totally inappropriate to assign the technical 
keyphrase 'complex numbers' to this fragment. A negative weight on 'Fi- 
bonacci' in the ID cloud of 'complex numbers' will prevent that. 

For the second text fragment the technique of stemming, which needs to be 
used, will give "number", and "complex" also occurs. But here also it would 
be totally inappropriate to assign the key phrase "complex numbers". It is not 
so easy to see how to avoid this. 

There are still other possible sources of difficulties because "complex" is 
also a technical term in algebraic topology and homological algebra so one 
can have a fragment like 

"The Betti numbers ofthis cell complex are..." 

or still worse: 

"The idea of a simplicial complex numbers amoung the most versatile notions 
that..." 

Here even the exact phrase "complex numbers" occurs and negative weights 
are a must to avoid a spurious assignment. 

Quite generally it seems fairly clear that the presence of the constituents of a 
standard key phrase in a given chunk of text is by no means sufficient to be sure 
that key phrase is indeed appropriate. This is especially the case for concepts 
that are made up out of frequently occurring words like "complex numbers" 
or "boundary value formula". But we have also seen this in the case of the 
"Burgers equation" above in Section 10. For the case of the phrase "complex 
numbers" one needs an identification cloud like 
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<KEYPHRASE NAME=<complex numbers> THRESHOLD=<0.4» 
<WORD VALUE=<complex numbers> WEIGHT=<0.5» 
<WORD VALUE=<complex> WEIGHT=<0.2» 
<WORD VALUE=<numbers> WEIGHT=<0.2» 
<WORD VALUE=<field> WEIGHT=<0.06» 
<WORD VALUE=<imaginary part> WEIGHT=<0.06» 
<WORD VALUE=<real part> WEIGHT=<0.06» 
<WORD VALUE=<absolute value> WEIGHT=<0.06» 
<WORD VALUE=<Gauss> WEIGHT=<0.06» 
<WORD VALUE=<argument> WEIGHT=<0.06» 
<WORD VALUE=<principal value> WEIGHT=<0.06» 
<WORD VALUE=<vector representation> WEIGHT=<0.06» 
<WORD VALUE=<addition> WEIGHT=<0.06» 
<WORD VALUE=<multiplication> WEIGHT=<0.06» 
<WORD VALUE=<Fibonacci> WEIGHT=<-0.5» 
<WORD VALUE=<Betti> WEIGHT=<-0.5» 

<\KEYPHRASE> 

So that besides "complex" and "number" one needs at least 2 more bits of 
supporting evidence to have a reasonable chance that the fragment in question 
is indeed has to do with the field of complex numbers. On the other hand if 
at least 8 of the last ten positive weight terms of the identification cloud above 
are present one is also rather sure that the fragment in question has to do with 
the field of complex numbers. The tentative identification cloud given above 
reflects this. But it is clear that assigning weights properly is a delicate matter; 
it is also clear that much can be done with weights. 

Thus also in the case of occurrences of the same concept in the same part of 
mathematics, more than one identification cloud may be a good idea, reflecting 
different styles of presentation and different terminological traditions. 

The concrete examples of Section 2 above also illustrates the possible value 
of negative information. 

8.20 Further Refinements and Issues 

There are a good many other issues to be addressed. Here is one. It is 
more or less obvious that making one keyphrase list with ID clouds for all of 
science and technology is a hopeless task. What one aims at is instead an Atlas 
of Science and Technology consisting of many weak thesauri that partially 
overlap, may have different levels of detail, and may focus on different kinds 
of interest. Much like a geographical atlas which has charts of many different 
levels of detail and many different kinds (mineralogical, roads and train lines, 
soil types, height, type of terrain, demographical, climatological, ...). Here 
the problem arises of how to match the different 'charts'. 

Another one is how to adapt the adaptive scheme of Section 10 to a situation 
with negative weights and how to handle insertion and deletion of ID cloud 
members. 
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In an enriched weak thesaurus a key phrase has not only words in its iden- 
tification cloud but also one or more classification numbers from MSC2000. 
In turn these classification numbers have identification clouds. The idea is that 
once a candidate key phrase has been found these are used to check that indeed 
the paper is related to the topics described by those classification numbers. 
This idea of referring to other (secondary) identification clouds can be used in 
all of the various applications described above. For instance it is needed of one 
uses a formula to identify a key phrase as suggested at the end of Section 10. 
Such referring to other identification clouds was also briefly mentioned in Sec- 
tion 13 above. Just how this should be implemented stil needs to be worked 
out. 

Probably the most crucial issue to be addressed at this stage is the formula- 
tion of a good probabilistic model of ID clouds complete with statistical esti- 
mators, see Section 16. A project in this direction has been started by the CWI, 
Amsterdam together with the IMI, Lithuanian Acad, of Sciences, Vilnius. 
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Abstract We consider continuous-time and discrete-time jump parameter linear control 

systems with semi-Markov coefficients and solution jumps that coincide with 
jumps of a semi-Markov random process. First, we derive stability conditions 
for semi-Markov systems of differential equations. We then determine necessary 
optimality conditions for the solutions of continuous-time and discrete-time con- 
trol systems. 
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9.1 Introduction 

Jump parameter linear control systems with Markov coefficients have been 
examined in many recent publications. To date, systems of equations defining 
optimal control have been derived, and recent research in this field now focuses 
on developing effective numerical methods for solving these systems ([Arov, 
1983]-[Castro, 1986]). 

In this article, we consider continuous-time and discrete-time jump param- 
eter linear systems with semi-Markov coefficients. These systems represent a 
generalization of those systems described above, since a semi -Markov process 
that satisfies certain conditions is Markov. The well-known systems of equa- 
tions which define optimal control for Markov jump parameter systems can be 
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obtained utilizing the systems of equations for semi-Markov control systems 
that we will derive in this paper, and they can be viewed as a particular case of 
these systems. 

The problem of obtaining optimal control for semi-Markov control sys- 
tems is closely correlated with the problem of finding necessary and sufficient 
Instability conditions for semi-Markov systems of differential equations. We 
therefore also consider this other problem in this article. 

In order to formulate the problems that we are going to study, we need to 
introduce some notation and review some well-known facts concerning finite- 
valued semi-Markov processes. 

Consider a finite-valued semi-Markov process £(£) with n possible states 
6i,...,8 n which jumps from some state 9k to some state S at consecutive 
times tj, j = 1,2, ..., to = 0. The random chain £(tj) is a Markov chain, the 
transition-probabilities matrix of which 

n = (ir a k)i, 

n sk = P{£(t j+ i) = e s | i{ti) = k } (k, 8 = 1, . . . , n) 

is given. (Note the order of the indices s, k here.) 

Jump times tj for the semi-Markov process £(£) are defined by distribution 
functions F s k(t) = P{T s k < t} of random variables T sk (s, k — 1, . . . , n) — 
the duration of time in which the process belongs to state 9k before itjumps to 
state 9 S , provided that such a jump takes place. 

The behavior of the process £(£) after any time tj is completely defined by 
II and the probability-functions matrix 

F{t) = (F ak (t))l k=1 

or the corresponding probability-density-functions matrix 

/(*) = (/.fc(«))?,*«i. 

The intensities q s k(t) are then defined by the formulas 

Qsk = n s kfsk{t) (k,s = l,...,n), (1.1) 

and we define 

n 

?*(*) = J^ ««*(*) (k = l,...,n). 

s=l 

Finally, we let 2\ denote the duration of time between two consecutive jump 
times tj and fy+i, provided that at time tj the process jumps to 6^. 

Obviously, qk{t) is the probability density of T&. Let Fk(t) denote the prob- 
ability distribution function of 2\ , and let ipk (t) denote the probability of the 
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event that no jumps take place during the time interval {tj, tj +t) provided that 
at time tj the process jumps to 9 k . Clearly, 

/oo 
q k {t)dt. (1.2) 

Now, let n different functions u k (t) (k = 1, . . . , n) be defined at t > 0. We 
will call a random process u(t,£(t)) a semi-Markov function if at 
tj <t < tj+i, £(£) = ^, we have 

u(t,Z(t)) = u k (t-tj); (1.3) 

i.e., between two jumps of the random process £(£), when £(£) = 6 k , the 
semi-Markov function coincides with the deterministic function u k (t — tj). In 
the special case when u k (t) = 6 k (k = 1, . . . , n), the semi-Markov function 
coincides with the semi-Markov finite-valued process £(£). 

Let (u(£, £(£))) denote the mathematical expectation of a semi-Markov func- 
tion, and denote the conditional mathematical expectations by 

v k {t) = {u{t,(i{t)) |£(O) = 0*> (fc=l,...,n). (1.4) 

We thus have the system of integral equations 

r-t n 

v k (t) = ijj k (t)u k (t) + y^v k {t - r) q sk (r) dr {k = 1, . . . ,n). (1.5) 



/ 5>*c 



Let v4fc(<) (A; = 1,2, ... ,n) be some given deterministic matrix functions, 
and let A(t, £(t)) denote a semi-Markov matrix function that takes values 
A k (t — tj) for tj < t < tj+i, provided that £(t) belongs to state 6 k during 
the time period [tj,tj+i). 

We consider the system of differential equations 

d 2^1 = A(t,at))X(t). (1.6) 

Assume that the solutions of the system have jumps which take place simulta- 
neously with the jumps of £(£). These jumps are defined by the formulas 

X(tj + 0) = T sk X(tj - 0), det r sfc ^ 0, (1.7) 

where T sk (s, k = 1, . . . , n) are some given matrices. 

Next, we introduce the notion of L2 _ stability for semi-Markov systems given 
by (6)-(7). First, let (X) denote the mathematical expectation E(X). Then, 
the system (6)-(7) is called L,2-stable if, for arbitrary X(0), we have 



/■oo 

/ D{t)dt 
Jo 



< oo, (1.8) 
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where D(t) = (X(t)X T (t)) and X(t)is the solution of (6). 

Linear continuous-time semi-Markov control systems are introduced in a 
similar way in Section 3. 

In Section 4, we consider discrete-time semi-Markov control systems, the 
coefficients of which depend on a discrete semi -Markov process £fc, the jumps 
of which can take place at times t = 0,1,2,... . 

The notations q s (k) (s = 1, . . . , n; k = 1, 2, . . . ) and t/j s (k) will be analo- 
gous to qii(i) (k = 1, ... ,n) and ipk{$) given earlier. Obviously, the following 
equalities hold: 

oo 

j=k+l 
n 

1s(k) = ^2qe s (k), 
e=i 

where the intensities qe s (k) are analogous to the intensities q s k{t) introduced 
earlier. 

9.2 Stability conditions for semi-Markov systems 

We introduce the quadratic form 

w(X) = X T BX, B = B T > (2.1) 

and define the Lyapunov function by the formula 

/•oo 

V - \ < w(X(t)) > dt, (2.2) 

Jo 

where X(t) is the random solution of system (6) with solution jumps (7). In 
order to find V, we introduce conditional stochastic Lyapunov functions 

J/-00 
' < w{x{t)) I x{o) = x, f(o) = e k >dt ^ 
o (2.3) 

(k=l,...,n). 

If the functions Vk(X) (k — 1, . . . , n) are known, then the function Fin (10) 
can be found from the formula 

v= Yl v k(X)f k (0, X)dX = X) / c * ° xxT fk(o, X) dX 

J Em i,_i t_i J Em, 

(2.4) 

= 53^*0^(0), 

jfc=i 
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where the symbol "o" denotes the scalar product of matrices, and the functions 
/fe(0, X) (k = 1, . . . , n) are defined by the formula 

P{£(0) = 9 k ,X€A}=[ f k (0,X)dX, 

Ja 

where A is an arbitrary domain in the Euclidean space E m . 

In order to form a system of equations which defines the functions V k (X) 
(k = 1, . . . , n), we introduce auxiliary quadratic forms 

u k (t, X) = X T U k (t) X = (w(X(t)) | X(0) = X, £(0) = B k ). (2.5) 

We denote by N k (t) the fundamental-solutions matrices for the systems of 
linear differential equations 

m^L = A k (t)X k (t) (fc = l,...,n). (2.6) 

The solutions of system (14) can then be expressed in the form 

X k (t) = N k (t)X k (0). 
Utilizing formulas (5), we derive the system of equations 

u k (t,X)=il> k (t)w(N k (t)X)+ / y / u s (t-r,T sk N k {T)X)q sk (T)dr 

(k = l,...,n). 

(2.7) 

This system can be rewritten as 



X T U k (t) X = Mt) X T N?(t) B N k (t) X 

+ I J2 X T NUt) Tj k U s (t - t) T sk N k {r) X q sk (r) dr (k = 1, . . . , n) 

(2.8) 



and as 



U k (t) = tl> k {tW(t)BN k (t)+ 

f J2 N k {r)Tj k U s {t - t) T sk N k {r)q sk {r)dr. {23) 

Jo ~\ 



s=\ 
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Integrating the system of equations, we find the following equations for the 
matrices C k : 

roo 

C k = / U k (t)dt. 
Jo 

roo 

= J Mt)N^(t)BN k (t)dt (210) 

roo n 

+ / £ ^k{t)Tj k CsT sk N k {t) q sk (t) dt (k = 1, . . . , n). 

The monotonicity of the operators Q* k defined by the formula 

roo 

Q* k C= / Nl{t)Tf k CT sk N k {t)q sk {t)dt (s,k = 1, . . . ,n) (2.11) 
Jo 

enables us to formulate a theorem on the /^-stability of the system (6). 

First, we formulate (without proof) the following lemma, which asserts that 
here, all norms are equivalent: 



LEMMA 1 The integral 



J roo 
f Mt)N^(t)BN k (t)dt 
o 

converges iff the integral 

roo 

Jk= / Mt)\\N k (t)\\ 2 dt 
Jo 

converges, where \\N k (t)\\ designates the Euclidean norm of N k (t). 
We then have the following theorem: 

THEOREM 1 Assume that for the system of linear differential equations (6) 
with random semi-Markov coefficients and solution jumps (7), the necessary 
stability conditions J k < oo (k = 1, . . . , n) are satisfied. Then the zero solu- 
tion of the system is L,2-stable iff for some positive definite matrices B k > 0, 
the system of equations 

n 

C k = B k + Y< Q*skCs (k = 1, . . . , n) (2.12) 

s=l 
has a positive-definite solution C k > (k = 1, . . . , n). 
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Proof. The proof follows from the fact that the existence of a positive-definite 
solution of (20) is equivalent to the convergence of the successive approxima- 
tions 



ci J+1) = B k + Y^Q* sk cl J) , Cf =0 (k = l,...,n;j = 0,1,2,...)- 



It can easily be shown that if for some matrices Bf. > (k = 1, . . . , n) the 
successive approximations converge, then they converge also for any arbitrarily 
chosen positive-definite matrices Bk (k = 1, . . . , n). ■ 



9.3 Optimization of continuous control systems with 
semi-Markov coefficients 

In this section, we find necessary optimality conditions for solutions of lin- 
ear continuous control systems with semi-Markov coefficients and solution 
jumps coinciding with jumps of a semi-Markov random process. Values of 
a quadratic functional are obtained with the help of equations for Lyapunov 
functions and minimized by choosing control coefficients. The necessary opti- 
mality conditions can be utilized in determining the optimal control. 

We consider the linear control system 

^ = A(t, mw) + B(t, mmt) q.d 

with random semi-Markov coefficients. We seek a control vector U(t) which 
minimizes the quadratic functional 

V=(l°° {X T (t)Q(t,S(t))X(t) + U T (t)L(t,£(t))U(t)) dty (3.2) 

where Q(t, £(£)) and L(t,£(t)) are symmetric positive definite matrices. Sup- 
pose that a semi-Markov process £(£) has jumps at times tj (j = 0,1,2, ... ,), 
where to = < t\ < t% < . . . . Assume that at tj < t < tj+i, £(t) = S , the 
following equalities hold: 

A(t,£(t)) = A s (t-tj), B(t,Z(t)) = B s (t-tj); 

Q(t,m) = Qs(t-tj), L(t,Z(t)) = L s (t-tj) (s = l,...,n), 

where A s {t),B s {t),Q s {t),L s {t) are deterministic matrices. Assume that the 
optimal control has the form 

U{t) = S{t,i{t))X{t), (3.3) 
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where £(£,£(£)) is a matrix with semi-Markov coefficients which, at t € 
[tj,tj+i), 
£(t) = S , takes values 

S(t,t(t)) = S,(t-t j ) (s = l,...,n). 

We introduce the following notation: 

H(t,m) = Q(t,ttt))+s T (t,mw,m)s(t,((t))- (3.4) 

We then obtain the system of linear differential equations with semi-Markov 
coefficients 

^. = G(U(t)) X(t), (3.5) 

for which we seek the value of the quadratic functional 

V = (J™X T (t)H(t,S(t)) X(t)dtj. (3.6) 

Assume that if there is a jump of the random process £(i) at time tj, then 
the solution of (25) also has a jump 

X(t j + 0) = T ak X(t j -0), (3.7) 

'as(t j + o) = 9 a , t(tj-o) = e k . 

For calculating the functional V, we utilize formula (12): 
n n - 

V = ]T C k o D k (0) = J2 V k (X)f k (0, X)dX, (3.8) 

fe=i k=i Je ™ 

where V k (X) — X T C k X are partial stochastic Lyapunov functions 

/•oo 

v k (x) = x T c k x = / (x T (t)H(t, £(*))*(*) I *(o) = x, ao) = e k )dt 

Jo 

(k = l,...,n). 

(3.9) 
We can now use the expression for C k obtained in equation (18): 

/•oo 

C k = / MWlit) (Qk(t) + Sl(t)L k (t)S k (t)) N k (t)dt 
Jo 

n f0O (3.10) 

+ J2 / qsk{t)Nl{t)Tj k C s T sk N k {t)dt (k = 1, . . . , n). 
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Here Nk{t) are fundamental-solutions matrices for the system of linear differ- 
ential equations 

^^ = G k (t)X k (t), G k (t) = A k (t) + B k (t)S k (t) 

X k (t) = N k (t)X (fc = l,...,n). (3.11) 

Next we find an expression for the partial stochastic Lyapunov functions 

V k {X) = X T C k X = f°° (xl(t) U k (t)Q k {t) + J2 1sk{t)lTkC»Tsi\ • 

X k {t) + U?(t)il> k (t)Lk(t)U k (t)\ dt, 

X fc (0) = X (fc = l,...,n). (3.12) 

The system of equations (31) can be written as 

^P- = A k (t)X k (t) + B k (t)U k (t), U k (t) = S k (t)X k (t). (3.13) 

Suppose that there exists an optimal control (in the form (23)) for the sys- 
tem (21) that minimizes the functional (22) and does not depend on the initial 
value X(0). We seek values for the symmetric matrices C k (k = 1, . . . ,n) 
which minimize the functional V. The problem of finding minimum values of 
V k (X) (k — l,...,n) by choosing controls U k {t) has been thoroughly in- 
vestigated; see, for example, [Arocena, 1990] and [Arocena, 1990A]. For our 
purposes, it is important that all matrices C k (k = 1 , . . . , n) in formula (32) 
are constants. 

Thus, the problem of obtaining optimal control (23) for a continuous control 
system with semi-Markov coefficients is reduced to n independent problems 
of obtaining optimal control for deterministic systems (33) with minimized 
functionals (32). 

We now apply some well-known results on finding optimal control for the 
system of equations 

dX(t) 



dt 



= A{t)X{t) + B(t)U(t), X{0) = X, (3.14) 



where we seek an optimal control U(t) which minimizes the quadratic func- 
tional 

roo 

X T CX= (X T (t)Q(t)X(t) + U T (t)L(t)U(t)) dt. (3.15) 

Jo 

Optimal control U(t) is defined by the formula 

U{t) = -L- 1 (t)B T (t)Y(t), Y{t) = K(t)X(t), 
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where the matrix K(t) satisfies the following matrix differential Riccati equa- 
tion: 

^P = -Q(t) - A T (t)K(t) - K(t)A(t) + K{t)B(t)L-\t)B T (t)K{t), 
at 

K(oo) = 0. 

(3.16) 

Methods for solving equation (36) are described, for example, in [Arocena, 
1990]. 

In view of these known results, we obtain the following expression for the 
optimal control U k (t) which minimizes the functional V k (X) for the system of 
equations (32): 

Uk(t) = -fl>^(t)L^(t)Bl(t)K k (f)X k (t) (k = 1, . . . , n), (3.17) 

where matrices K k (t) (k = 1, . . . , n) satisfy the following Riccati-type system 
of equations: 

~^ = -Mt)Qk{t)-Y,q sk (t)Tj k C a T sk -A T k (t)K k {t) 

-K k (t)A k (t) + K k {t)B k {t)^\t)L^\t)B k {t)K k {t), 

K k (oo) = 0, (k = l,...,n). (3.18) 

The systems of equations (37)-(38) define necessary optimality conditions 
for solutions of the system of equations (21). Matrices S k (t) (k = 1, . . . ,n) 
defining the optimal control (23) are defined by the matrix equations 

Sk(t) = -^\t)Ll\t)Bl{t)K k {t) (k = 1, . . . ,n), 

and matrices C k are defined by the equalities 

C k = K k (0) (k = l,...,n). 

We solve each equation of the system (38) as a parameter equation with 
parameter matrices K k (0) (k = 1, . . . , n), utilizing numerical methods devel- 
oped for systems of type (36). 

Thus, we obtain a system of n matrix equations with n unknown matri- 
ces K k (0) (k = 1, . . . ,n), which enables us to find the values of K k (0) 
(k = 1, . . . , n), and then the values of K k (t), t > 0. 

Now introduce new matrices 

Rk(t) = ^ k \t)K k {t) , V»(0) = 1 (* = l,...,n). (3.19) 

The system of equations (38) then takes the form 
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dR k(t) _ „ ^ ,r 



= -Q fc (t) - ^ (*)flfc(*) - fl fc (t)4fc(t) 
+R k {t)B k {t)Ll\t)Bl{t)R k {t) 



(g^M+E^w- 



fl fc (oo) = 0(fc = l,...,n), (3.20) 

and optimal control is given by the formulas 

U k {t) = -L^{t)Bl{t)R k {t)X k {t) (k = l,... t n). (3.21) 

The necessary optimality conditions (40) and (41) generalize previously ob- 
tained optimality conditions for control systems with coefficients dependent 
on a Markov random process. 

The following particular case is important in many applications. Suppose 
that the semi-Markov process £(£) cannot remain in any state 6 S for a time 
period greater than T s > (s = 1, . . . , n). Assume that 

q ks (t) = 0{t>T s ), tp s (t) = 0{t>T s ). (3.22) 

We obtain the system of equations 

C h = f ' i/>k(*)Nk(t) (Qk(t) + Sl(t)L k (t)S k (t)) N k (t)dt 
Jo 

+ T / Qsk{t)Nl{t)Tj k C s T sk N k {t)dt (k = 1, . . . , n), 

and also the system of equations for the functions V k (X): 

fT k ( n \ 

V k (X) = ^ (X?(t)U k (t))Q k (t) + J2Qsk(t)T%CsT 8k ]X k (t) 

+ U?(W k (t)L k (t)U k (t))dt, X(0) = X (k = l,...,n). 

(3.23) 

In the system (38), we assume that 

K s (T s ) = (s = l,...,n). (3.24) 

Since tp s {T s ) = (s = 1, . . . ,n), conditions (44) will be satisfied if the 
matrices R S {T S ) are bounded, in view of (39). 
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In order to find matrices R s (t) (s = 1, . . . , n), we have to integrate the 
system of nonlinear matrix differential equations (40). Each equation be- 
longing to the system can have a singular point t = T s , where ip s (T s ) = 
(s — 1, . . . , n). We can obtain necessary conditions for boundedness of matri- 
ces R s (t) at singular points t = T s : 

n 

^'(T S )R S (T S ) + J2 qks(T s )TlR k (0))T ks = (s = 1, . . . , n). (3.25) 
fc=i 

We formulate these results as a theorem. 

THEOREM 2 Assume that the optimal control U(t) in the form (23) for 
a control system (21) exists. Then the optimal control U(t) that minimizes 
the quadratic functional (22) is defined by the system (41), where matrices 
R s (t) (s = 1 , . . . , n) satisfy the Riccati-type system of nonlinear differential 
equations (40). 

9.4 Optimization of discrete control systems with 
semi-Markov coefficients 

We consider the discrete control system 

X k+l = A{k,i k )X k +B{k,i k )U k (k = 0,1,2,...) (4.1) 

with semi-Markov coefficients. We seek the control vector U k which mini- 
mizes the quadratic functional 

V = ( E ( x kQ(k,tk)X k + u£L(k,£ k )U k ) V (4.2) 

where Q(k,£ k ),L(k,£ k ) are symmetric positive definite matrices. Let 
kj (j — 0, 1, 2, . . . ), ko = 0, be jump times of a semi-Markov process which 
takes a finite number of distinct values 0\,...,6 n . Assume that at A; € [kj , fcj+i) , 
£k = Qs> the matrix coefficients in system (46) and in formula (47) are defined 
by the following expressions: 

A(k,S k ) = A s (k - kj),B(k,Z k ) = B s (k - kj), 

Q(k,£ k ) = Q s {k-kj),L(k,£ k ) = L s (k-kj) (s = l,...,n), 

where A s (k),B s (k),Q s (k),L s (k) are deterministic matrices. 
Assume that the optimal control has the form 

U k = S(k,£ k )X k (k = 0,1,2,...), (4.4) 
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where S(k,^ k ) is a matrix with semi-Markov coefficients, and that at 
k € [kj, kj+i), £fc = 9 S we have the equalities 

S(k,t k ) = S s (k-k j ) (s = l,...,n). (4.5) 

Now we introduce the matrices 

G{k,S k ) = A(k,Z k ) + B(k,Z k )S(k,Z k ) 

H(k,( k ) = Q(k,Z k ) + S T (k,Z k )L(k,t k )S(k,t; k ). (4.6) 

We obtain the system of linear difference equations 

X k+1 = G(k,£ k )X k (fc = 0,l,2,...), (4.7) 

with the minimized quadratic functional 

V=^(f^XlH{k^ k )X k \. (4.8) 

^ k=e i 

Next, we introduce partial stochastic Lyapunov functions 

oo 

V S (X) = X T C S X = ^2(X k H(k,Z k )X k \X = X,to = 9,) (s = l,...,n). 

k=0 

(4.9) 

If the functions V S (X) (s = 1, . . . ,n) are calculated, the value of V in (53) 
can be obtained by the formula 

» n n 

V = E X T CsXf s (0, X)dX = J2 C* o £> s (0). (4.10) 

•'■Em s= l s= l 

Now, consider the system of linear difference equations (52). Assume that 
the solution of this system is multiplied from the left by constant matrices T 8 g, 
det T s ( ^ (s, £ = 1, . . . , n) at times when the random process £ k has jumps 
from state 9g to state 9 S . 

Let k € [kj, kj + i), £ k = 9g. The system of equations (52) takes the form 

X k+1 = G e (k - kj)X k (e=l,...,n), 

where 

G e (k) = A e (k) + B e (k)S e (k) (I = 1, . . . , n). 

Let systems of linear difference equations 

xl% = G s (k)xl s) (k = 0,1,2,...; s = l,...,n) 
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have fundamental-solutions matrices N s {k) (k = 0, 1, ... ; s = 1, . . . , n), 
which implies that 

X { k s) = N 3 (k)X ( s) (fc = 0,1,2,...; s = l,...,n). (4.11) 

Assume that if the conditions 

tk = 9t at fc € [kj-u kj), & = S at A; € [fy, fy +1 ) 

are satisfied, then the following equalities hold: 

X k = jV € (A: - fy-i)^.^ (fy_i < A; < %) 

X fc . = TseNeik-kj^X^,, detT st ^0 
Xk = N s {k - k^Xk^ (kj<k<k j+1 ), (4.12) 

i.e., at jump times, the solution of (52) is multiplied by a nonsingular matrix 
T S £. The system of equalities 



V S (X) = X T C S X = "£((X { k s) ) T Mk)Q s (k) 

fc=0 

j^qU^TlC.T^X^ +{ui a) ) T Uk)L a {k)ui s) ) 



fc=0 
n 
+ * 

'=1 

(* = 1 n) (4.13) 



is analogous to the system (32) and can be derived in a similar way. 
Assuming that 

H s (k) = Q s (k) + SjL s (k)S s (k) (s = l,...,n) 

^ s) = 5 s (fc)4 s) ( S = l,...,n), 

we can rewrite equalities (58) as 



oo 

F S (X) = X T C S X = J^X T Nj(k)(xp s (k)H s (k) 

fc=0 
n 



fc=0 
n 

+ ] 

=1 

( S = l,...,n). (4.14) 



Minimization of the functional V in (53) is reduced to the minimization of 
the functions V s (X) in (54). Thus, the problem of finding optimal control is 
reduced to n problems of optimizing the deterministic control systems 

X& = A s (k)X { k s) + B s (k)U ( k s) (s = 1, . . . , n), (4.15) 
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where the optimal control Ujf minimizes the quadratic functional V S (X). 

Now, we state a well-known result on optimizing systems of linear differ- 
ence equations with variable coefficients 

X k+1 = A(k)X k + B(k)U k (k = 0,1,2,...), (4.16) 

where the optimal control U k minimizes the quadratic functional 

oo 

V = E {^Q{k)X k + UlL{k)U k ) . (4.17) 

fc=0 
If an optimal control exists, it is defined by the formula 

U k = -L~ 1 (k)B T {k) (E + K(k + l)B(k)L- 1 (k)B T (k))~ 1 
K(k + l)A(k)X k 

= - (L{k) + B T {k)K{k + l)B{k))~ l B T {k)K{k + l)A(k)X k , 

(4.18) 

where the matrices K(k) (k = 0, 1, 2, ... ) satisfy the system of equations 

K{k) = Q{k) + A T (k)K(k + l)A(k) - A T (k)K(k + l)B(k) ■ 
■{L{k) + B T {k)K{k + l)B{k))- l B T {k)K{k + l)A(fc) 

(k = 0,1,2,...). (4.19) 

Next, we find an optimal control for the system of linear difference equa- 
tions (46) with minimized functional (47) by finding U^ which minimize the 
functionals (59) for the systems of difference equations (60). We obtain the 
following formulas: 

U { k s) = -(L s (k)Mk) + Bj(k)K s (k + l)B s (fc)) _1 

Bj(k)K s (k + l)A s (k)X { k s) (4-20) 

(fc = 0,l,2,...; s = l,...,n) 



K s (k) = ip s (k)Qs(k) + Y^qes(k)Tf s CeTe s + Aj(k)X s (k + l)A s (k) 

e=i 
-A T s (k)K s {k + l)B s (k)(L s (k)ip s (k) 
+B^(k)K s (k + l)B s {k))- x Bj{k)K s {k + l)A s {k) 

(k = 0,1,2,...; s = l,...,n). (4.21) 
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These equations can be simplified by setting 

P a (k-1)= - * Tz K s (k) (fc = 0,l > 2 > ... ; * = l,...,n) (4.22) 

Ul a) = S a (k)Xt (fc = 0,1, 2,...; s = l,...,n), (4.23) 

where ^(—l) is defined to equal 1. 
We thus obtain the following system of matrix equations: 

5 a (fc) = - (£,(*) + BjWPsWBsik))' 1 Bj(k)P s (k)A s (k) 

(fc = 0,l,2...; a = l,2,...); (4-24) 

^-^ = J^W + Eff ^, 

+A2'(*:)P i (*:)A s (*:) - ^(W(*0(M*0 
+Bj(k)P s (k)B a (k))- 1 Bj(k)P s (k)A s (k)) 

(fc= 1,2, ... ; s = l,...,n), (4.25) 

which define necessary optimality conditions for solutions of the system (46). 
The system of equations (66) contains unknown matrices C((£ = 1, . . . , n). 
Now, we utilize the known auxiliary formula 

oo 

XlK{k)X k = ^2{xjQ(j)X k + UjL(j)Uj) 

(fc = 0,1,2,...) (4.26) 

for the control system (61), where Xj, Uj (j = k, k + 1, k + 2, . . . ) are op- 
timal solutions and optimal control which minimize the functional (62). From 
this formula, it follows that the matrices K(k) are symmetric and positive 
semi-definite. From equality (71) and formulas (59), it follows that 

C s = K s (0) (s = l,...,n). (4.27) 

We formulate the obtained result in a theorem. 

THEOREM 3 Assume that the optimal control in the form (49) 

U k = S(k,t k )X k (A; = 0,1,2,...) 

for a control system (46) 

X k+1 = A(k, C k )X k + B(k, S k )U k (k = 0, 1, 2, ... ) 

exists. Then the optimal control is defined by the system (69), where matri- 
ces P s {k) (s — l,...,n; k = 0,1,2,...) satisfy the Riccati-type system of 
difference equations (70). 
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Abstract A general approach, based on covering by cells, induced by Euclidean graphs, is 

developed to provide asymptotic characterizations of multivariate sample densi- 
ties. This approach provides high dimensional analogs of basic results for ran- 
dom partitions based on one-dimensional sample spacings. The methods used 
in the proofs yield asymptotics for empirical (^-divergences based on fc-spacings 
and also for the total edge length of the graphs involved. 



10.1 Introduction and background 

Statistics in the form of (^-divergences are used for several purposes includ- 
ing, among others, goodness-of-fit tests and parametric estimation. The Pear- 
son x 2 is a well known statistic of this type. They are in general designed for 
discrete or one dimensional continuous data. Although x 2 and related methods 
can be used for continuous multivariate data, they are virtually useless in high 
dimensions. How to deal with empirical ^-divergences when the observations 
are continuous and multivariate has been a long-time need. Basically, the diffi- 
culty is to define suitable analogues in R d of spacings on the line. In this work, 
we use random Euclidean graphs as adaptive schemes to define statistical dis- 
tances of continuous samples in R d , d > 1. Formally, we prove strong laws for 
empirical ^-divergences based on multidimensional spacings induced by Eu- 
clidean graphs. In particular, these laws extend some basic results of sample 



° Research supported in part by NSA grant MDA904-01-1-0029 



224 RECENTS AD VANCES IN APPLIED PROBABILITY 

spacing theory on the line. While this work is related to [Jimenez, 2002], the 
approach taken here is considerably simpler and more general; it relies heavily 
on the objective method developed in [Aizenman, 1982; Ahmed, 2000] and 
more recently [Penrose, 2002A]. The methods also yield strong laws for the 
empirical ^-divergence for /c-spacings. 

10.1.1. 0-divergence statistics for discrete data 

Given a strictly convex function <j) : R + — > R, [Csiszar, 1978] ^-divergence 
between two nonnegative n-dimensional vectors p := (pi,---,p n ) and 
q:= (qi,...,q n )is 



/*(p,q):=X>*(- > ) 



As in [Csiszar, 1967], we interpret undefined expressions by 

<t>(0) = lim 0(i), 
t->o+ 



o<M^)=o, 



0j>(?-)= lime^Walim^. (1.1) 

^ \0/ £ ->o+ \eJ t-*oo t 

Assuming that <fi is normalized (that is 4>{\) = 0), and that 

127=1 Pi — 127=1 Qi> tnen J ensen ' s inequality implies 

^(P.q) > and I*(p,q) = iff p = q. (1.2) 

These are properties of a distance. However, 1$ is not a distance: the triangle 
inequality does not hold and I<j> is not symmetric, i.e., in general 

If we additionally assume that </> is nonnegative, then (1.2) holds even if 
127=1 Pi ^ 127=1 9»- < - )n tne other hand, for any strictly convex and normalized 
4>, the function <j>* defined by 

am \ a( \ ( n v W + h) + 0(1- h) -20(1) 

4> (x) := <j>{x) - (x - 1) lim -i '- K — '- K -L 

h->o+ In 

is strictly convex, normalized, and nonnegative. 
Moreover, if J27=i Pi = Er=l* th en 

^*(p,q) = ^(p,q)- 

Thus we can and will assume without loss of generality that $ is strictly convex, 
normalized, nonnegative, and that (1.2) holds whether 127=\Pi = 127=1 Qi or 
not. 
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Property (1.2) makes (^-divergence useful in various fields. [Csiszar, 1978] 
reviews how ^-divergence can be used in statistics. Roughly speaking, if p 
are observed frequencies then i^(p, p) or/^(p,p) can be used as loss- function 
in statistical inference. Frequently used (^-divergences in statistics involve the 
power-divergence family introduced by [Cressie, 1984] 

For example, when (3 — > Othen ipp(x) — ► tpo(x) = — log x + x — 1. Thus 

Wp»p) = J2P il °s (?) and ^o(P»P) = X]Pilog ( J ) 

are the log-likelihood ratio and the Kullback-Leibler divergence respectively. 
When p = 1/2, il> 1/2 (x) = 2(y/i - l) 2 and 

n 

^ 1/2 (P,P) =/^ 1/a (p,p) = 2j2(VPi~ ^/Jif 

i=l 

is the Hellinger distance. When /5 = 2, ^2 0*0 = (* — l) 2 /2 and the statistics 

(P* - Pi) 2 . r I A v V^ {Pi ~ Pi) 2 



^ 2 (p-p) = E ^hp- and 7 ^p'P) = E 



,=i 2 ^ " v " r ' fe 2 ^ 

yields the x 2 statistics of Neyman and Pearson respectively. The statistics 
7^(p,p)and /^(p,p) are one of the more important cases of statistical dis- 
tances and have been used for several purposes including, among others, 
goodness-of-fit tests of discrete data ( [Cressie, 1984]) and parametric esti- 
mation ([Lindsay, 1994]). 

For any strictly convex, normalized, and nonnegative function <p(x) defined 
on (0, oo), its adjoint function (jf{x) = X(j){l/x) is also strictly convex, nor- 
malized, and nonnegative. In particular, if^/j € ^ then ip% = tpi-p. Since 
-fy°(P>P) = ^(P)P)' without loss of generality we will only consider the sta- 
tistical distance /</,(p, p) and we will omitted in the sequel its adjoint statistical 
distance I<p(p, p). See [Jimenez, 2001] for some aspects related with diver- 
gence statistics and its adjoints. 

10.1.2. Empirical ^-divergences based on spacings 

The use of empirical (^-divergences with one dimensional continuous data 
is related with spacing theory as follows. Consider the order statistics 

Xx,X 2 , . ■ ■ ,X n of n independent random variables with common distribu- 
tion F. Let Xq = — oo. Then, the empirical estimate of the one dimensional 
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transformed spacing F(X{) — F(Xj_i) is 1/n. Thus, 

I+(F,n) := ^({FTO-F^-O.l^t^nhfl/n}) 

:= -^(n-iFiX^-FiXt-!))) (1.3) 

can be viewed as a statistical distance between the sample distribution F and the 
empirical distribution. The importance of statistical distances based on spac- 
ings dates from the classic paper of [Pyke, 1965]. When F is unknown, the 
statistic I^{G, n) has been used to test the hypothesis Hq : G — F. [Darling, 
1953] provided the first systematic study of this statistic. If we assume that F 
is in some family of distributions Q, then F can be estimated by minimizing 
i^(G, n) over G € Q. A remarkable case is given by ipo(%) = — log x + x — 1 
which corresponds to the maximum product of spacing method, introduced 
by [Cheng, 1983] and later by [Ranneby, 1984]. The strong consistency of 
the maximum product of spacing method and the strong consistency of the 
goodness-of-fit test based on I^ Q (F,ri) relies on the following strong law, 
proved by [Shao, 1995] under mild conditions on G and F, 



n lim Ito(G,n) = JE^Po [e ~{x) 



dF(x) a.s. (1.4) 



Here and elsewhere e is an exponential random variable with mean one. 

The main result of [Hoist, 1979] implies, for general (p, the asymptotic nor- 
mality of the empirical (^-divergence based on k— spacings 

i=k 

under the hypothesis Hq : G = F. This includes, for the particular case 
k — 1, the asymptotic normality of I<p(F, n). Also the asymptotic normality 
of Sh{G n ,n) has been studied for special sequences of alternatives G n such 
that G n — » F when n — > +oo; see [Hall, 1986] and its references. Under 
stringent regularity conditions on G, F, and 0, [Hoist, 1981] proved a central 
limit theorem for I^G, n). However the asymptotic normality of S%(G, n) for 
fixed G 7^ F has been an open problem dating from the 1950's [Pyke, 1965]. 

10.2 The nearest neighbor </>-divergence and main results 

(^-divergence We show in this work that random Euclidean graphs with a 
locally defined structure provide a natural scheme for generalizing one dimen- 
sional results based on spacings. We will first consider a scheme based on 
nearest neighbors. 
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For every sample point Xi consider the cell Q := d{X\, ..., X n ) centered 
at Xi with radius equal to the distance to the nearest neighbor in the sample 
{Xi,...,X n }. We will use these cells to define a high dimensional spacing 
statistic analogous to the classical one-dimensional statistic. The cell d is of 
course a ball, but we prefer to call it a cell, since this anticipates more general 
spacing statistics described in the sequel. An attractive feature of these spac- 
ings is a monotonicity property identical to that for the classic one dimensional 
spacings: the cell around a given point decreases in volume as the number of 
points increases. 

Throughout X\, X2, ...are independent random variables in R d with com- 
mon probability density /, and g is an arbitrary probability density function. 

DEFINITION 1 For each n > 1, we define for 1 < i < nthe sample spacings 



\,n '■— Di(Xi,...,X n ) :— j 

JCi(Xi,...,X n ) 



D iin :=D i {X 1 ,...,X n ):= / dx (2.1) 





and the transformed spacings 

Df n := D°(X U ..., X n ) := f 9 {x)dx. (2.2) 

JCi{X\,...,X n ) 

For all 1 < i < n we have £>f n (a:i, ...,£„) < D^ n (x lt ...,i„,|/i, ...,y k ) 
for any functions < g\ < gi- We will measure the discrepancy between g and 
the sample density / by comparing the transformed spacings 
{Df t n , \<i<n} with {D[ n , 1 < i < n}. 

We will use 

N<p({Dlnh {DO) ■= T, D L<t> (^) (2-3) 

as a measure of the "distance" between g and /; we term this the "nearest 
neighbors (^-divergence". It is a discrete version induced by the balls of the 
nearest neighbors graph of [Csiszar, 1967] ^-divergence between g and / on 
B, namely 

J m *im) dx - (2A) 

If/ is unknown, we can replace Df n in (2.3) by its empirical estimate 

l ' n n 

In this manner, we obtain the following statistic, which we call the "empirical 
nearest neighbor ^-divergence", and which forms one of our central objects of 
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interest: 



JVJ :=Nl{X u ...,X n ) :=WD 9 itn },{Dln}) = £X>(«--Bf,»)- (2-5) 

i=l 

Our main purpose is to describe the a.s. behavior of iV?(Xi, ...,X n ). We first 
introduce some notation. 

DEFINITION 2 Ze£ $ &e the class of all normalized and strictly convex func- 
tions 4> '■ K + — ► R + such that there exists 7 > such that for all a > we 
have / °° <?i 4+ T(Q!i) exp(-t)cft < 00. 

It is easy to check that the frequently used (f>'s in statistics, including the 
power divergence family, are in the class $. 

The following limit theorem, the main result of this section, establishes the 
a.s. consistency of the empirical nearest neighbor (^-divergence. We let A 
denote the support of /. 

THEOREM 1 Let Xi, X2, ... be independent random variables with a density 
f and let g be a continuous density. Iff and g are bounded away from zero 
and infinity on A and ifcj) G 3>, then 



lim M(X l ,...,X n ) = /"/(x)eL 

n— 00 v J A I 






dx a.s. 



(2.6) 



The integral in (2.6) represents a divergence between / and g, which by 
Jensen's inequality and the identity E[e] = 1, exceeds the Csiszar divergence 
(2.4). Thus a small empirical nearest neighbors ^-divergence implies a small 
Csiszar divergence. 

lfg(x) = f(x) a.e., then the right hand side of (2.6) equals E[0(e)]. On the 
other hand, if g(x) ^ f(x) on some subset with positive Lebesgue measure, a 
combined application of Fubini's theorem and Jensen's inequality gives 



/ f(x)E 

J A 



<t>\e 



. 9(x) 
fix) 



dx = K 



I f(x)<t> 
Ja 



. g(x) 

fix) 



dx 



>E[0(e)]. 



Thus, using this notation we obtain the following corollary. 

COROLLARY 1 Under the same conditions of Theorem 1, 

lim iV|(X 1> ...,X n )>E[^(e)] a.s. 



(2.7) 



Moreover, there is strict inequality in (2. 7) except for the case g{x) — f(x) a.e. 

In dimension d — 1, Theorem 1 is closely related with the empirical <f>- 
divergence for fc-spacings. The next theorem extends (1.4) to the context of 
fc-spacings and general <f>. 
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THEOREM 2 Let X\,X2, ■■■ be independent real valued random variables 
with common density f. Let g be a continuous density and G(x) = J"_ g{u)du. 
LetT(k, 1) be a gamma random variable with parameters k and 1. Iff and g 
are bounded away from zero and infinity on A and if(f> 6 $, then 



limSj(G,n)= f f(x)E 

"- 00 V J A 



*|r ( M,f$ 



rfx o.s. (2.8) 



Remark 2.1 It is a simple consequence of the uniform integrability of the left- 
hand side of (2.6) and (2.8) that the limits there also hold in L l . 
Remark 2.2 [Bickel, 1983] develop central limit theorems for statistics based 
on nearest neighbor distances. They consider the special case 

4>{x) = exp(— x) and use the approximation 

I f{x)dx^f{X i )\C i {X l ,...,X n )\ 

JCi(Xi,...,X n ) 

and confine attention to sums £Zi ex P( — nf(Xi)\d\), where here and else- 
where \C\ denotes the volume of a set C. The strong consistency established 
by Theorem 1 can be viewed as an initial step in extending [Bickel, 1983] 
to more general (j>. From the standpoint of goodness of fit tests, it would be 
desirable to supplement Theorem 1 with a central limit theorem for the empir- 
ical nearest neighbors (^-divergence and to provide an explicit formula for the 
limiting variance. 

Remark 2.3. (a Shannon entropy estimate) The proof of Theorem 1 describes 
the large sample behavior of the sum-function of nearest neighbor spacings 



1 n 



n • , 



These statistics provide estimates for entropy-type functionals of the sample 
density. To fix this idea consider <p(t) := ipo(t) = — logt + t — 1 and g = 1. 
An elementary computation involving Theorem 1 and convention (1.1) imply 

1 n 

- Y^ log(n ■ A,n) -+ H(f) + E log(e) a.s., 

n <=i 

where H{f) := — J f(x) log f(x)dx is the well-known Shannon entropy. Es- 
timates of H(f) are of general interest; see [Dudewicz, 1987] for a review of 
the one-dimensional case. They can be used in the context of the maximum 
entropy method which has wide applications in several fields. 
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Remark 2.4. (equivalence with maximum likelihood) Suppose that the sample 
density f(x), belongs to the parametric family T := {fe{x) : 8 € ©}. Let 6q 
be such that f(x) = fg (x). The maximum likelihood (ML) estimate of 9q is 
obtained by maximizing the log-likelihood function 



l n (e) :=J2log fe(Xi). 



i=\ 



Let K(f, g) denote the Kullback-Leibler relative entropy, that is 

K(f, 9 }-.= Jf(x)\og(^y x . 
By the strong law of large numbers, K(fg , fg) < oo implies 
-(ln(Oo)-ln(0))^K(f eo ,f e ) a.s. 

Ft 

On the other hand, if fg is bounded away from zero and infinity on the support 
of fg , then by Theorem 1 we have 

I ]T log(n • D{%) - -K(f 6o , f e ) + Epog(e)] a.s. (2.9) 

i=i 

Thus, under general conditions, maximizing the log-likelihood function is asymp- 
totically equivalent to maximizing the left-hand side of (2.9). We will call 

^^argmmi^^-^n) ( 2 - 10 ) 

i=i 

the minimum nearest neighbors ^-divergence (M<^>D) estimate. Roughly speak- 
ing, the ML estimate and 0- i og are asymptotically equivalent. 
Remark 2.5. (multivariate version of maximum spacing method) Under gen- 
eral conditions, the ML estimate can have optimal asymptotic properties and 
thus 0- iog must have the same type of asymptotic properties. However, when 
the likelihood function is unbounded, the ML estimate can be inconsistent. 
The M(f>D method is a multivariate version of the maximum product of spacing 
(MPS) method, which is an alternative to the ML method when the likelihood 
function is unbounded. Since the sum of the logarithm of spacings is always 
upper bounded, even in the cases where the ML method fails, the MPS method 
can generate asymptotically optimal estimates. This feature can be observed 
for example in many mixture models, which are not necessarily restricted to 
the one dimensional case. Similarly to the one dimensional case, the empiri- 
cal nearest neighbors ^-divergence is always lower bounded. Thus, the M^D 
method can generate consistent estimates even when the ML method fails. 
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Remark 2.6. (consistency of Mi^D estimates) For (j>{t) := — logi, Theorem 
1 resembles the asymptotics (1.4) for the logarithm sum of one-dimensional 
spacings obtained by [Shao, 1999]. Information-type inequalities such as Corol- 
lary 1 play a key role in proving strong consistency of the MPS method ([Shao, 
1999]) and related one-dimensional methods. In the same way, our results can 
be applied to prove strong consistency of the estimate 6$ defined in (2.10). For 
example, Corollary 1 implies that the estimate 6$ is always consistent for any 
(f> G $, if is finite. General consistency theorems may be obtained assuming 
regularity conditions on JF. 

10.3 Statistical distances based on Voronoi cells 

Theorem 1 shows the efficacy of using random graphs based on nearest 
neighbor distances to define statistical distances which generalize consistency 
results for one dimensional spacings to higher dimensions. Nearest neighbor 
graphs are easy to generate but in some cases it may be advantageous to con- 
sider statistical distances using other graphs which have a strong locally de- 
fined structure. We illustrate the possibilities by considering graphs involving 
Voronoi tessellations. 

Voronoi tessellations generated by random sets ofpoints are of general inter- 
est and have been used in many diverse fields ([Aurenhammer, 1991], [Obake, 
1992], [Moller, 1994]). Much like nearest neighbors graphs, Voronoi tessella- 
tions may be used as an adaptive scheme to compare probability densities on 
R d , d > 1. 

Given a set ofpoints X — {x\, ..., x n } C R d and a Borel subset B of R d , 
consider for any X{ € X D B the locus ofpoints closer to Xi than to any other 
point of X H B. The intersection of this set ofpoints with B is a Voronoi cell 
and is denoted by Vi{B) := Vi(B;xi, ...,x n ), that is 

Vi(B- xi, .... x n ) :={yeB: \\y - Xi \\ < \\y - Xj \\, \/ Xj EXC\B}, 

where ||-|| denotes the Euclidean distance. IfXj ^ i? then we define Vi{B) = 0. 
Thus, {Vi(B), 1 < i < n} is a partition of B which is called the Voronoi 
tessellation of B generated by X and is denoted by V{B\ X). It is understood 
that if X n B = then V(B; X) = B. Also, if Xi,X 2t ... are i.i.d. with a 
density whose support is A, then we reserve the notation Vi{X\, ...,X n ) for 
Vi(A;Xi,...,X n ). Figure 1 shows the Voronoi tessellation generated by a 
uniform random sample on the unit square. 

We may use the Voronoi cells to define high dimensional sample spacings 
as follows. 

DEFINITION 3 For each n > 1, we define for 1 < i < nthe sample spacings 

A,n~ D i {X 1 ,...,X n ):= f dx (3.1) 

Jv i (x u ...,x n ) 
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Figure 1. Voronoi tessellation of the unit square. 



and the transformed spacings 

Df in :=Z?f(Xi I ...,X n ):= / 9 {x)dx. 

Jv i (X 1 ,...,X n ) 



(3.2) 



Exactly as in the context of the nearest neighbors graph, we will measure the 
discrepancy between g and the sample density / by comparing the transformed 
spacings {Df n , 1 < i < n}with {D,( n , 1 < i < n}. 

We thus obtain the following statistic, which we call the "empirical Voronoi 
^-divergence", and which forms the natural analog of the empirical nearest 
neighbors ^-divergence (2.5): 

1 n 
*? := V}(X U -, X n ) := ^({^J, {£>[„}) = ^<t>(n- D^). (3.3) 

The following main result is the Voronoi analog of Theorem 1 . Let V\ 
denote a homogeneous Poisson point process of constant intensity 1 on M. d , let 
denote the origin ofR d , and let ey denote the volume of the Voronoi cell 
around in the Voronoi tessellation on V\ U 0. While Theorem 3 is similar to 
Theorem 1.1 of [Jimenez, 2002], which assumes continuity of/, the method 
of proof is much easier and follows the relatively simple proof of Theorem 1. 
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THEOREM 3 Let X\,X2,.-- be independent random variables with a density 
f and let g be a continuous density. Iff and g are bounded away from zero 
and infinity on A and if<p G $, then 



lim VHX u ...,X n ) = f f(x)E 

n->oo 1> J A 



£v 



9(x) 



dx a.s. 



(3.4) 



Other Euclidean graphs may also be used as adaptive schemes to com- 
pare probability densities. For this, we must define the cells around the sam- 
ple points according to the geometric characteristics of the considered graph. 
Thus, empirical (^-divergences can be defined analogously to (3.3) and in gen- 
eral they satisfy a.s. asymptotics of the form (3.4), with ey replaced by the 
volume of the related cell around the origin induced by the graph on V\ U 0. 

10.4 The objective method 

Theorem 1 is anticipated by Theorems 2.2 and 2.4 of [Penrose, 2002A], 
which uses the objective method to establish a weak law of large numbers 
for stabilizing functionals of random variables. Similarly Theorem 3 is an- 
ticipated by Theorem 2.5 of [Penrose, 2002A]. However, neither Theorem 1 
nor Theorem 3 is a consequence of [Penrose, 2002A] since neither the nearest 
neighbors nor Voronoi statistic is translational invariant (translating the sam- 
ple points changes the statistic according to the density g). Thus one needs to 
modify existing methods in order to establish Theorems 1 and 3. In the first 
part of this section we prove Theorem 1. Completely similar methods may be 
used to prove Theorem 3. 

Let A denote the support of/ and for all (> 0, let V, denote a Poisson point 
process with intensity measure (/ : A —> R. To prove Theorem 1, we start by 
showing that a Poissonized version of (2.6) holds in expectation, namely we 
show that if we only assume Effete)] < oo for all a > 0, then 



E 



cj> e 



'/(*); j 



f(x)dx. 



lim / E <f>(\ [ g{u)du) f{x)dx = f 

{->ooJ A L Jc{x,v{) J J A 

(4.1) 

The proof of (4.1) may be established using lengthy and somewhat cumber- 
some methods, as in [Jimenez, 2002], which actually requires continuity of /. 
It is more instructive and much easier to use the following key lemma, which 
further illustrates the power of the objective method [Aizenman, 1982; Ahmed, 
2000]. 

To set the stage, we note that for fixed x and for large (, the volume of 
the cell C(x, V, ) when multiplied by (, is roughly the same as the volume of 
the cell C(x,Vf( x )). Here and elsewhere V T denotes a homogeneous Poisson 
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V 



point process on R with intensity r. Since for all t we have t\C(x,V- 

|C(0, V\)\ = e, this suggests the following lemma, where here and elsewhere, 
P 
— > denotes convergence in probability. 

LEMMA 1 For almost all x € A we have as (— > oo 

</(x)|C(x,^)|-^ 



e. 



(4.2) 



We defer the proof of Lemma 1 and show how to use it to deduce Theorem 
1 . By hypothesis we have positive finite constants k\ and k 2 such that for all 
x e A 

k\ < f(x) < k 2 and ki < g(x) < k 2 . (4.3) 

Since ef© has the same distribution as {f c , x v \ g(x)du, we need only 
to show for almost all x G A that 



lim 

(-00 



E 



<t> A 



/ 5(u)du - ^ I ( / 5(x)du 

./Cfopf) / V Jc(x,v {f(x) ) , 



(4.4) 



as (-— oo. Letting x' := x'({) denote a point in the cell C(x, V,) such that 
g(x')\C(x,V!)\ — A f c , v f,g(u)du we equivalently only need to show that 



^(xOlc-fo^l)-*^ 



E 



as (— > oo. 

Now (4.5) is bounded by the sum of 



(4.5) 



E 



*(™WjW'M-<Wi> 



and 



E 



M i£l £ )^(?Me 



(4.6) 



(4.7) 



Given a convex function G $, let^i(x) := <p(x)l^ Q ^(x) be its decreasing 
part and let 02(x) := l[i i+OC) )(x) be its increasing part. By Lemma 1 and the 
continuity of <f>, 

*(</(^IC(,^)l)-*.(^) 

tends to zero in probability. Since fa is increasing we have for all (> 



E 



fa((f{x) 9 j^\C{x,T{)\ 



< E 



*(£ 
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and thus the assumed integrability of ^(ae), a > 0, shows that 
^((/(x)fg|C (l ,pf)|)-*(^),(>0, 



are uniformly integrable. Thus, ([Dudley, 1989], Thm 10.3.6) 



E 



4 /w ^ c( ^< >'H (ff 



o. 



(4.8) 



Similarly, since <j>i is decreasing 



E 



<t>i(im g j^\c(x,r()\ 



< E 



A I kl6 



showing that 



<t>i((f(x)j^\C(x,V{)\ 



4>i 



9{x') 
ft*) 



<>0, 



are also uniformly integrable. Thus splitting (j) as 0i + fo, we see that the 
difference (4.6) tends to zero as (—> oo. Similarly, by the continuity of g we 
have as (— * oo 



fix) 1 



9(x) 



and together with uniform integrability arguments, this shows that the differ- 
ence (4.7) also tends to zero as (— * oo. Thus (4.4) tends to zero as desired. 

Since the cell C := C(x,Vl) is a nearest neighbors cell, it depends only 
locally on the surrounding points and this localization, together with the mo- 
ment condition f^° <p i+/y (at) exp(—t)dt < oo, makes it straightforward to 
de-Poissonize the mean limit (4.1). This can be accomplished by following 
verbatim Lemma 2.5 of [Jimenez, 2002]. 

Since the density / is assumed bounded away from zero and infinity and 
since the volume of the nearest neighbor cell around x with high probabil- 
ity depends on sample points distant C(\ogn/n) 1 ' d from x, we may follow 
the proof of Lemma 3.1 of [Jimenez, 2002] and use isoperimetric, arguments 
to establish that the difference of our de-Poissonized statistic with its mean, 
namely |£ £?=i <f>(n • Df n ) - £ ££=i E(p(n ■ Df n )|, is almost surely of order 
o(l), showing that convergence of the mean is equivalent to a.s. convergence. 
We leave these details to the reader. 



It only remains to prove Lemma 1. For r > 0, let B r (x) denote the Eu- 
clidean ball {y G R d : \y — x\ < r} of radius r centered at x. 



236 RECENTS ADVANCES IN APPLIED PROBABILITY 

Proof of Lemma 1. Given t > 0, recall that V T denotes a homogeneous 
Poisson point process on R d with intensity r. For all z E M. d , let C(z, V T ) 
denote the nearest neighbors cell around z with respect to P r . Note that for all 

t > Owe have t\C(0,V t )\ = |C7(0, ^i) | = e, since the volume ofthe nearest 
neighbors cell around the origin is a mean one exponential random variable. 

C(z t V T ) is locally defined in the sense (section 6 of [Penrose, 2001]) that 
there is a random variable R := R(z, r) with exponentially decaying tails and 
an a.s. finite random variable Coo (2, V r ) such that 



Coo (Z, V T ) = C(Z, V T D B R {z) U A) 

for all locally finite A outside Br(z). 

Given 7-*,, the Poisson point process with intensity (/ : A — > R + ,for 
all x 6 A let V(f( x ) be a homogeneous Poisson point process with constant 
intensity (f(x). We may assume that P(/( x ) is coupled to V, in such a way 
that for all Borel sets BCi4we have 

P[P{ ¥> P</(z)l < ( / I/O*) - f(v)\dy. (4.9) 

Next, for any Lebesgue point x for f,xEA, for all (, t G R + consider the 
event 

E(x, (,t) := {R((V d x, e /d V {f(x) ) < t, ( l > d V{ = ( l/d V {f{x) on B,(^)}. 

By (4.9) we have that P[E(x, (, t) c ] is bounded by 

P[E(( 1 /rf X) ( 1/d V {f(x) ) >t] + (f \f(x) - f(y)\dy. (4.10) 

./B t «V<«x) 

Since / is Lebesgue integrable and since a; is a Lebesgue point for /, the 
integral in (4.10) tends to zero as A — > 00. Since R^^x, ( i ^ d T > ^f^ x )) has the 
same distribution as R(0, "Pf( x ))> which is finite a.s., it follows that if t is large 
enough, then the first term in (4.10) tends to zero as {— > 00. Therefore, for all 
S > and for ( and t large enough, 

P{E(x,(,t) c }<6 



Statistical Distances Based on Euclidean Graphs 237 

Now we can prove Lemma 1 as follows. We observe 
(f(x)-\C(x,V{)\ 
= (f{x) ■ \C(x,P()\ ■ l E{xU) + (/Or) • \C(x, V{)\ ■ l E ( Xt{ ,t)c 
= f(x) ■ \C(e/ d x, (V-tj/)! . i E{xAt) + (f( x ) . \C(x, V{)\ ■ l E{x , {tt)c 

= f(x)\c(( l / d x, ( l/d r{nB Ri{1/dxAfix)) )\-i E{xM 

+ </0r) • \C(x, V[)\ ■ l E{x ,{,ty 

= fix) ■ Ci^x, ( 1/d V {f(x) ) ■ l E(Xi{it) + (fix) ■ Cix, V{\ ■ l E(x , { , t )c, 

where the last equality holds since on the set Eix, (, t) we have ( 1 / d 'Pf = 

( lA ^</(*)- 

The above is equal in distribution to 

/(x)-|C(0 > 7> /(x) )|-/(x)|C(0 > P /(l) )|.l B(Xi{it) c + (/(x)-|C'(x,7' ( / )|.l B(!Bi(it) c 

(4.11) 

The first term is equal in distribution to |C(0, V\)\ and the last two terms in 
(4.11) tend to zero in probability as (—* oo and t — > oo. This follows from the 
probability estimate P[E(x, (, ^) c ] < S as well as the bounds 

E [(ICCx.^Jlj < E[\C(x,V kl )\] and E[\Cix,V kl )\ p ] < oo for all x e R d 

and allp > 1. 

This completes the proof of Lemma 1. I 

It only remains to give the proof of Theorem 2. Since the methods are very 

similar, we only give a sketch. 

Proof of Theorem 2. We will follow the proof of Theorem 1 closely. Let 
V T be a homogeneous Poisson point process on R with constant intensity r. 
Let {Xj}be the realization of V r and let X^be the usual order statistics. For 
any x G R, let Ck(x,V T ) = ^fc(x) — x denote the length of the associated 
^-spacing, where X^^ is the k-th point in V T to the right of x. The proof of 
Theorem 2 depends upon the following lemma. 

LEMMA 2 For almost all x € A we have as (— ► oo 

(f(x)\C k ix,v()\-^Fik,l). (4.12) 

To prove this lemma, we simply follow the proof of Lemma 1 with Cfc(x, V,) 
replacing Cix, V,) and note that for allr > we have 

r\C k ix,V r )\ = Cki^VJ^Yik,!). 
Now just follow the proof of Theorem 1. ■ 
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Abstract Given the price of a call or put option, the Black-Scholes implied volatility is 

the unique volatility parameter for which the Black-Scholes formula recovers the 
option price. This article surveys research activity relating to three theoretical 
questions: First, does implied volatility admit a probabilistic interpretation? Sec- 
ond, how does implied volatility behave as a function of strike and expiry? Here 
one seeks to characterize the shapes of the implied volatility skew (or smile) 
and term structure, which together constitute what can be termed the statics of 
the implied volatility surface. Third, how does implied volatility evolve as time 
rolls forward? Here one seeks to characterize the dynamics of implied volatility. 



11.1 Introduction 
11.1.1. Implied volatility 

Assuming that an underlying asset in a frictionless market follows geomet- 
ric Brownian motion, which has constant volatility, the Black-Scholes formula 
gives the no-arbitrage price of an option on that underlying. Inverting this 
formula, take as given the price of a call or put option. The Black-Scholes im- 
plied volatility is the unique volatility parameter for which the Black-Scholes 
formula recovers the price of that option. 

This article surveys research activity in the theory of implied volatility. In 
light of the compelling empirical evidence that volatility is not constant, it is 
natural to question why the inversion of option prices in an "incorrect" formula 
should deserve such attention. 

To answer this, it is helpful to regard the Black-Scholes implied volatility 
as a language in which to express an option price. Use of this language does 
not entail any belief that volatility is actually constant. A relevant analogy is 
the quotation of a discount bond price by giving its yield to maturity, which 
is the interest rate such that the observed bond price is recovered by the usual 
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constant interest rate bond pricing formula. In no way does the use or study 
of bond yields entail a belief that interest rates are actually constant. As YTM 
is just an alternative way of expressing a bond price, so is implied volatility is 
just an alternative way of expressing an option price. 

The language of implied volatility is, moreover, a useful alternative to raw 
prices. It gives a metric by which option prices can be compared across dif- 
ferent strikes, maturities, and underlyings, and by which market prices can be 
compared to assessments of fair value. It is a standard in industry, to the extent 
that traders quote option prices in "vol" points, and exchanges update implied 
volatilty indices in real time. 

Furthermore, to whatever extent implied volatility has a simple interpreta- 
tion as an average future volatility , it becomes not only useful, but also natural. 
Indeed, understanding implied volatility as an average will be one of the focal 
points of this article. 

11.1.2. Outline 

Under one interpretation, implied volatility is the market's expectation of 
future volatility, time-averaged over the term of the option. In what sense does 
this interpretation admit mathematical justification? In section 2 we review 
the progress on this question, in two contexts: first, under the assumption that 
instantaneous volatility is a deterministic function of the underlying and time; 
and second, under the assumption that instantaneous volatility is stochastic in 
the sense that it depends on a second random factor. 

If instantaneous volatility is not constant, then implied volatilities will ex- 
hibit variation with respect to strike (described graphically as a smile or skew) 
and with respect to expiry (the term structure); the variation jointly in strike 
and expiry can be described graphically as a surface. In section 3, we review 
the work on characterizing or approximating the shape of this surface under 
various sets of assumptions. Assuming only absence of arbitrage, one finds 
bounds on the slope of the volatility surface, and characterizations of the tail 
growth of the volatility skew. Assuming stochastic volatility dynamics for the 
underlying, one finds perturbation approximations for the implied volatility 
surface, in any of a number of different regimes, including long maturity, short 
maturity, fast mean reversion, and slow mean reversion. 

Whereas sections 2 and 3 examine how implied volatility behaves under 
certain assumptions on the spot process, section 4 directly takes as primitive the 
implied volatility, with a view toward modelling accurately its time-evolution. 
We begin with the no-arbitrage approach to the direct modelling of stochastic 
implied volatility. Then we review the statistical approach, Whereas the focus 
of section 3 is cross-sectional (taking a "snapshot" of all strikes and expiries) 
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hence the term statics, the focus of section 4 is instead time-series oriented, 
hence the term dynamics. 

11.1.3. Definitions 

Our underlying asset will be a non-dividend paying stock or index with non- 
negative price process St- Generalization to non-zero dividends is straightfor- 
ward. 

A call option on S, with strike K and expiry T, pays (St — K) + at time T. 
The price of this option is a function C of the contract variables (K, T), today's 
date t, the underlying St, and any other state variables in the economy. We will 
suppress some or all of these arguments. Moreover, sections 2 and 3 will for 
notational convenience assume t = unless otherwise stated; but section 4, 
in which the time-evolution of option prices becomes more important, will not 
assume t = 0. 

Let the risk-free interest rate be a constant r. Write 

, ■ K 
x := log 



Ste'V-V 

for log-moneyness of an option at time t. Note that both of the possible choices 
of sign convention appear in the literature; we have chosen to define log- 
moneyness to be such that x has a positive relationship with K. 

Assuming frictionless markets, Black and Scholes [Black & Scholes, 1973] 
showed that if S follows geometric Brownian motion 

dS t = fiStdt + aS t dW t 

then the no-arbitrage call price satisfies 

C = C BS {a), 

where the Black-Scholes formula is defined by 

C BS (a) := C BS (S u t,K,T,a) := S t N(d!) - Ke- r ^-^N{d 2 ). 

Here 

dl ' 2 " aVT-t ± — 2~ ' 

and N is the cumulative normal distribution function. 

On the other hand, given C(K,T), the implied [Black-Scholes] volatility 
for strike K and expiry T is defined as the I(K, T) that solves 

C{K, T) = C BS (K, T, I{K, T)). 
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The solution is unique because C BS is strictly increasing in a, and as a — > 
(resp. oo), the Black-Scholes function C BS (a) approaches the lower (resp. 
upper) no-arbitrage bounds on a call. 

Implied volatility can also be written as a function / of log-moneyness and 
time, so I(x,T) := I(S t e x+r ^ T ~ t \T). Abusing notation, we will drop the 
tilde on /, because the context will make clear whether / is to be viewed as a 
function of K or x. 

The derivation of the Black-Scholes formula can proceed by means of a 
hedging argument that yields a PDE to be solved for C(S, t): 



dC 1 2a2 d 2 C dC _ n 
!H + 2° S d<P +rS dS- rC = > (U) 



with terminal condition C(S,T) = (5 — K) + . Alternatively, one can appeal 
to martingale pricing theory, which guarantees that in the absence of arbitrage 
(appropriately defined- see for example [Delbaen & Schachermayer, 1994]), 
there exists a "risk-neutral" probability measure under which the discounted 
prices of all tradeable assets are martingales. We assume such conditions, and 
unless otherwise stated, our references to probabilities, distributions, and ex- 
pectations will be with respect to such a pricing measure, not the statistical 
measure. In the constant-volatility case, changing from the statistical to the 
pricing measure yields 



dS t = rS t dt + aS t dWt. 



So log St is normal with mean (r — cr 2 /2)(T — t) and variance a 2 (T — t), and 
the Black-Scholes formula follows from C = e- r( - T -^E(S T - K) + . 

11.2 Probabilistic Interpretation 

In what sense is implied volatility an average expected volatility? Some 
econometric studies [Canina & Figlewski, 1993; Christensen & Prabhala, 1998] 
test whether or not implied volatility is an "unbiased" predictor of future volatil- 
ity, but they have limited relevance to our question, because they address the 
empirics of a far narrower question in which "expected" future volatility is 
with respect to the statistical probability measure. Our focus, instead, is the 
theoretical question of whether there exist natural definitions of "average" and 
"expected" such that implied volatility can indeed be understood - provably - 
as an average expected volatility. 
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11.2.1. Time-dependent volatility 

In the case of time-dependent but nonrandom volatility, a simple formula 
exists for Black-Scholes implied volatility. 
Suppose that 

dS t = rS t dt + a(t)S t dW t 

where a is a deterministic function. Define 

r r \ 1/2 



/ 1 f T V 



Then one can show that log St is normal with mean (r — <t 2 /2)T and variance 
a 2 T, from which it follows that 

C = C BS (a). 

and hence 

I = a. 

Thus implied volatility is equal to the quadratic mean volatility from to T. 

11.2.2. Time-and-spot-dependent Volatility 

Now assume that 

dS t = rS t dt + a(S t , t)S t dW t (2. 1) 

where a is a deterministic function, usually called the local volatility. We 
will also treat local volatility as a function a of time-0 moneyness x, via the 
definition a(x,T) := a (Soe x+rT ,T); but abusing notation, we will suppress 
the tildes. 

11.2.2..1 Local volatility and implied local volatility. Under local 
volatility dynamics, call prices satisfy (1.1), but with variable coefficients: 

f + i.V,^0 +r 5|f- rC = O, (2.2, 

and also with terminal condition C(5, T) = (S — K) + . 

Dupire [Dupire, 1994] showed that instead of fixing (K, T) and obtaining 
the backward PDE for C(S,t), one can fix (S, t) and obtain a forward PDE 
for C(K, T). A derivation (also in [Bouchouev & Isakov, 1997]) proceeds as 
follows. 

Differentiating (2.2) twice with respect to strike shows that G := d 2 C/dK 2 
satisfies the same PDE, but with terminal data 5(S — K). Thus G is the Green's 
function of (2.2), and it is the transition density of S. By a standard result (in 
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[Friedman, 1964], for example), it follows that G as a function of the variables 
(K, T) satisfies the adjoint equation, which is the Fokker-Planck PDE 

9 w-&4° 2{K ' T)K2G ) +r £< {Ka)+rG = o - 

Integrating twice with respect to K and applying the appropriate boundary 
conditions, one obtains the Dupire equation: 

«->*V(ff,7,|g + r*||=0, (2.3) 

with initial condition C(K, 0) = (S - K) + . 

Given call prices at all strikes and maturities up to some horizon, define 
implied local volatility as 

/ §C , R dC v 1/2 

*W = ( *$&■ ) • CM) 

According to (2.3), this is the local volatility function consistent with the given 
prices of options. Define implied local variance as L 2 . 

Following standard terminology, our use of the term implied volatility will, 
in the absence of other modifiers, refer to implied Black-Scholes volatility, not 
implied local volatility. The two concepts are related as follows: Substituting 

C = C BS (I(S e x+rT ,T)) (2.5) 

into (2.4) yields 

See, for example, Andersen and Brotherton-Ratcliffe [Andersen & Brotherton- 
Ratcliffe, 1998]. Whereas the computation of I from market data poses no 
numerical difficulties, the recovery of L is an ill -posed problem that requires 
careful treatment; see also [Avellaneda et al, 1997; Bouchouev & Isakov, 1997; 
Coleman, Li & Verma, 1999; Gzyl & Villasana, 2003]. These issues will not 
concern us here, because our use of implied local volatility L will be strictly as 
a theoretical device to link local volatility results to stochastic volatility results, 
in section 1 1.2.3. .1. 

11.2.2..2 Short-dated implied volatility as harmonic mean local volatil- 
ity. In certain regimes, the representation of implied volatility as an aver- 
age expected volatility can be made precise. Specifically, Berestycki, Busca, 
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and Florent ([Berestycki, Busca & Florent, 2002]; BBF henceforth) show that 
in the short-maturity limit, implied volatility is the harmonic mean of local 
volatility. 

The PDE that relates implied volatility I(x, T) to local volatility cr(x, T) is, 
by substituting (2.5) into (2.3), 

Let Iq(x) be the solution to the ODE generated by taking T = in the PDE. 
Thus 



ll-a\x,Q)(l-x^/l^ = 0. 



Elementary calculations show that the ODE is solved by 

*>(*)-( f-rhs 

\J o-(sx,0) 

A natural conjecture is that the convergence Iq = limr^o I (x, T) holds. In- 
deed this is what Berestycki, Busca, and Florent [Berestycki, Busca & Florent, 
2002] prove. Therefore, short-dated implied volatility is approximately the 
harmonic mean of local volatility, where the mean is taken "spatially," along 
the line segment on T = 0, from moneyness to moneyness x. 

The harmonic mean here stands in contrast to arithmetic or quadratic means 
that have been proposed in the literature as rules of thumb. As BBF argue, 
probabilistic considerations rule out the arithmetic and quadratic means; for 
example, consider a local volatility diffusion in which there exists a price level 
H € (So,K) above which the local volatility vanishes, but below which it 
is positive. Then the option must have zero premium, hence zero implied 
volatility. This is inconsistent with taking a spatial mean of a arithmetically or 
quadratically, but is consistent with taking a spatial mean of a harmonically. 

11.2.2..3 Deep in/out-of-the-money implied volatility as quadratic mean 
local volatility. BBF also show that if local volatility is uniformly continu- 
ous and bounded by constants so that 

< a s$ a(x, T) ^a, 

and if local volatility has continuous limit(s) 

°"±(£) = Ii m cr(x,t) 
x— »±oo 
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locally uniformly in t, then deep in/out-of-the-money implied volatility ap- 
proximates the quadratic mean of local volatility, in the following sense: 



Urn J(x,T)=(i / 4(s)d 

x-»±oo \J Jo 



( l/2 

S ' 



The idea of the proof is as follows. Considering by symmetry only the x — ► oo 
limit, let Ioo(T) := (^ J cr^.(s)ds) 1//2 . Note that /qq induces, via definition 
(2.4), a local variance I? that has the correct behavior at x = oo, because the 
denominator is 1 while the numerator is a\{T). 

To turn this into a proof, BBF show that for any e one can construct a func- 
tion tp(x) such that 1 < i/j(oo) < 1 + e and such that / oo (T)'0(a;) induces via 
(2.4) a local volatility that dominates L. By a comparison result of BBF, 

lira sup I(x,T) < Joo(r)^(oo) < (1 + e)J 00 (T). 

i— >oo 
On the other hand, one can construct ifi such that 

liminf /for) > /ooCTXoo) > (1 -^/^(r). 
Taking e to yields the result. 

11.2.3. Stochastic volatility 

Now suppose that 

dS t = rS t dt + a t S t dW t , 

where cr t is stochastic. In contrast to local volatility models, at is not deter- 
mined by St and t. 

Intuition from the case of time-dependent volatility does not apply directly 
to stochastic volatility. For example, one can define the random variable 



/ 1 r T \ l l 2 



\l/2 

but note that in general 

I jL Ed. 

For example, in the case where the a process is independent of W, the mixing 
argument of Hull and White [Hull & White, 1987] shows that 

Co = Ee- rT (S T - K)+ 

= E(E[e- rT (S T - K)+\{a t }o^T}) = EC BS (a). 

However, this is not equal to C BS (Ea) because C BS is not a linear function of 
its volatility argument. What we can say is that for the at-the-money-forward 
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strike, C BS is nearly linear in a, because its second a derivative is negative but 
typically small; so by Jensen J < E<7, but equality nearly holds. 

Note that this I « E<7 heuristic is specific to one particular strike, that it 
assumes independence of at and Wt, and that the expectation is under a risk- 
neutral pricing measure, not the statistical measure. We caution against the 
improper application of this rule outside of its limited context. 

So is there some time-averaged volatility interpretation of /, that does hold 
in contexts where I » Ect fails? 



11.2.3..1 Relation to local-volatility results. Under stochastic volatil- 
ity dynamics, implied local variance at (K, T) is the risk-neutral conditional 
expectation of of, given St — K. The argument of Derman and Kani [Der- 
man & Kani, 1998] is as follows. Let f(S) = (S - K) + . Now take, formally, 
an Ito differential with respect to T: 

d T C = d T [e- rT E(S T - K) + ] = Ed T [e- rT (S T - K) + ) 



,-rTn 



= e 



-rT 



E 



f'{S T )dS T + -<7 T S T 5{S T - K)dT - {S T - K) + dT 
rS T H{S T -K) + \o T S T 5{S T - K) - {S T - K) + 



dT 



-rT 



E 



rKH(S T -K) + \a T S T 8{S T - K) 



dT, 



where H denotes the Heaviside function. Assuming that (St, o~t) nas ajoint 
density p5 T) v T ,let ps T denote the marginal density of St- Continuing, we 
have 



d£ 
dT 



= ~ rK dK + 2 6 ^ // vs2 ^ s ~ K )PS T ,VT( s > v ) dsdv 
= - tK qk + 2 e ~ rTR J v Ps T ,v T {K,v)dv. 



So, by definition of implied local variance, 

Consequently, any characterization of/ as an average expected local volatil- 
ity becomes tantamount to a characterization of / as an average conditional 
expectation of stochastic volatility. 

APPLICATION 1 1.2.1 The BBF results in sections 11. 2.2. .2 and 11.2.2.3 can 
be interpreted, under stochastic volatility, as expressions of implied volatility 
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as [harmonic or quadratic] average conditional expectations of future volatil- 
ity. 

11.2.3..2 The path-from-spot-to-strike approach. The following rea- 
soning by Gatheral [Gatheral, 2001] provides an interpretation of implied volatil- 
ity as average expected stochastic volatility, without assuming short times to 
maturity or strikes deep in/out of the money. 
Fix K and T. Let 

BS &C™ 
' OS 2 

be the Black-Scholes gamma function. 

Assume there exists a nonrandom nonnegative function v(t) such that for 
alltin(0,T), 

_ E[afS?F BS (S u t,^t))} 

w E[s?rBs iSu t,cT(t))) {ZJ) 

where 

/ 1 rT U/2 

°tt)-={j— t J t v ( u ) du ) ■ 

Note that at need not be a deterministic function of spot and time. 
Define the function 

c(S,t):=C BS (S,t,a(t)), 

which solves the following PDE for (S,t) € (0,oo) x (0,T): 

dc 1 ,. n2 d2c 9C 

ai = -2 v V s asi- TS l>s +ra (2 ' 8) 

We have 



C(K, T) = E[e- ri (S T - K) + ] = E[e~ rI c(S T , T)] 

rT 



= c(S ,0) + e- rl E 



j( ^{S u t)dt+ l -alS^(S t ,t)dt 

dc 
+ -^(S t ,t)dS t - rc(S t ,t)dt 

d 2 i 



= c(S o ,0) + e~ rT E J \tf-v(t))St?^(S u t)dt 
= C (5 ,0) = C 6s ( 1 S ,0,a(0)). 
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using Ito's rule, then (2.8), then (2.7). Therefore 

J 2 = <7 2 (0) = ^| v(t)dt=±J E G 'a 2 di, (2.9) 

where the final step re-interprets the definition (2.7) of v(t) as the expectation 
of of with respect to the probability measure &t defined, relative to the pricing 
measure P, by the Radon-Nikodym derivative 

dG t S?r BS (S u t,a(t)) 



dF ■ E[S?TBS(St,t,a(t))Y 



So (2.9) interprets implied volatility as an average expected variance. More- 
over, this expectation with respect to Gt can be visualized as follows. Write 



E Gt 



r-oo 

of = / E(of|5 t = s)K t (s)ds, (2.10) 

JO 



where the nonrandom function K t is defined by 

s 2 T BS (s,t,a(t))p St (s) 



Kt(s) := 



fZ°8*rBS(8,t,a(t))PS t (s)ds' 



and ps t denotes the density of St- 

Thus E(af|5t = s) is integrated against a kernel k(s) which has the fol- 
lowing behavior. For t J. 0, the k approaches the Dirac function d(s — So), 
because the ps t factor has that behavior, while the s 2 r BS factor approaches an 
ordinary function. For t f T, the k approaches the Dirac function S(s — K), 
because the s 2 Y BS factor has that behavior, while the ps t factor approaches an 
ordinary function. At each time t intermediate between and T, the kernel has 
a finite peak, which moves from So to K, as t moves from to T. 

This leads to two observations. First, one has the conjectural approximation 

ES7 t 2 «E(<7 4 2 |S-t = **(*)), 

where the non-random point s*(t) is the s that maximizes the kernel « t . By 
(2.10), therefore, 

7 2 «iy E(o*\S t = s*(t))dt. 

Second, the kernel's concentration of "mass" initially (for t = 0) at So, and 
terminally (for t = T) at K resembles the marginal densities of the S diffusion, 
pinned by conditioning on St = K. This leads to Gatheral's observation that 
implied variance is, to a first approximation, the time integral of the expected 
instantaneous variance along the most likely path from Sq to K. We leave 
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open the questions of how to make these observations more precise, and how 
to justify the original assumption. 



APPLICATION 1 1.2.2 Given an approximation for local volatility, such as in 
[Gatheral, 2001], one can usually compute explicitly an approximation for a 
spot-to-strike average, thus yielding an approximation to implied volatility. 

For example, given an approximation for local volatility linear in x, the 
spot-to-strike averaging argument can be used to justify a rule of thumb (as 
in [Derman, Kani & Zou, 1996]) that approximates implied volatility also lin- 
early in x, but with one-half the slope of local volatility. 

11.3 Statics 

We examine here the implications of various assumptions on the shape of 
the implied volatility surface, beginning in section 11.3.1. with only minimal 
assumptions of no-arbitrage, and then specializing in 11.3.2. and 11.3.3. to the 
cases of local volatility and stochastic volatility diffusions. The term "statics" 
refers to the analysis of I(x, T) or I(K, T) for t fixed. 

As reference points, let us review some of the empirical facts about the 
shape of the volatility surface; see, for example, [Rebonato, 1999] for further 
discussion. A plot of / is not constant with respect to K (or x). It can take 
the shape of a smile, in which I(K) is greater for K away-from-the-money 
than it is for K near-the-money. The more typical pattern in post-1987 equity 
markets, however, is a skew (or skewed smile) in which at-the-money / slopes 
downward, and the smile is far more pronounced for small K than for large 
K. Empirically the smile or skew flattens as T increases. In particular, a 
popular rule-of-thumb (which we will revisit) states that skew slopes decay 
with maturity approximately as 1/VT; indeed, when comparing skew slopes 
across different maturities, practitioners often define "moneyness" as x/\JT 
instead of x. 

The theory of how / behaves under various model specifications has at least 
three applications. First, to the extent that a model generates a theoretical 
/ shape that differs qualitatively from empirical facts, we have evidence of 
model misspecification. Second, given an observed volatility skew, analytical 
expressions approximating I(x, T) in terms of model parameters can be useful 
in calibrating those parameters. Third, necessary conditions on / for the ab- 
sence of arbitrage provide consistency checks that can help to reject unsound 
proposals for volatility skew parameterizations. 

Part of the challenge for future research will be to extend this list of models 
and regimes for which we understand the behavior of implied volatility. 
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113.1. Statics under absence of arbitrage 

Assuming only the absence of arbitrage, one obtains bounds on the slope of 
the implied volatility surface, as well as a characterization of how fast I grows 
at extreme strikes. 

11.3.1..1 Slope bounds. Hodges [Hodges, 1996] gives bounds on im- 

plied volatility based on the nonnegativity of call spreads and put spreads. 
Specifically, if K\ < K 2 then 

C{K X ) > C{K 2 ) P{K X ) ^ P{K 2 ) (3.1) 

Gatheral [Gatheral, 1999] improves this observation to 

c { K^cm 3*><3*2l, (3 . 2) 

which is evident from a comparison of the respective payoff functions. Assum- 
ing the differentiability of option prices in K, 

Substituting C = C BS {I) and P = Schonbucher(I) and simplifying, we 

have 

N(-di) dl N(d 2 ) 



y/TN'{d{) dx VTN'{d 2 y 

where the upper and lower bounds come from the call and put constraints, 
respectively. 

Using (as in [Carr & Wu, 2002]), tine Mill's Ratio R(d) := (1-N(d))/N'(d) 
to simplify notation, we rewrite the inequality as 

R(di) < M < R{-d 2 ) 
VT " dx ^ s/T 

Note that proceeding from (3.1) without Gatheral's refinement (3.2) yields the 
significantly weaker lower bound —R(d 2 )/\/T. 

Of particular interest is the behavior at-the-money, where x = 0. In the 
short-dated limit, as T — > 0, assume that 1(0, T) is bounded above. Then 

d lt2 (x = 0) = ±I(0,T)Vf/2 — ■» 0. 

Since R(0) is a positive constant, the at-the-money skew slope must have the 
short-dated behavior 



!«>.-> 



■°(7P)' 



0. (3.3) 
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In the long-dated limit, as T — » oo, assume that 1(0, T) is bounded away from 
0. Then 

d h2 (x = 0) = ±I(0,T)VT/2 — ► ±oo. 

Since R(d) ~ d~ l as d — ► oo, the at-the-money skew slope must have the 
long-dated behavior 



i«w 



O(^), T-+oo. (3.4) 



REMARK 1 1.3.1 According to (3.4), ?/ze rule of thumb that approximates the 
skew slope decay rate as T -1 ' 2 cannot maintain validity into long-dated ex- 
piries, 

113.1..2 The moment formula. Lee [Lee, 2002] proves the moment 
formula for implied volatility at extreme strikes. Previous work, in Avellaneda 
and Zhu [Avellaneda & Zhu, 1998], had produced asymptotic calculations for 
one specific stochastic volatility model, but the moment formula is entirely 
general, and it uncovers the key role of finite moments. 

At any given expiry T, the tails of the implied volatility skew can grow no 
faster than x 1 ' 2 . Specifically, in the right-hand tail, for |a:| sufficiently large, 
the Black-Scholes implied variance satisfies 

I 2 (x,T)^2\x\/T (3.5) 

and a similar relationship holds in the left-hand tail. 

For proof, write /* := (2\x\/T)^ 2 , and show that C BS {I) < C BS (P) for 
large \x\. This holds because the left-hand side approaches but the right-hand 
side approaches a positive limit as x — » oo. 

APPLICATION 11.3.2 This bound has implications for choosing functional 
forms of splines to extrapolate volatility skews. Specifically, it advises against 
fitting the skew's tails with any function that grows more quickly than x 1 ' 2 . 

Moreover, the tails cannot grow more slowly than x 1 ' 2 , unless St has finite 
moments of all orders. This further restricts the advisable choices for parame- 
terizing a volatility skew. To prove this fact, note that it is a consequence of the 
moment formula, which we now describe. 

The smallest (infimal) coefficient that can replace the 2 in (3.5) depends, of 
course, on the distribution of St, but the form of the dependence is notably 
simple. This sharpest possible coefficient is entirely determined by p in the 
right-hand tail, and q in the left-hand tail, where the real numbers 

p := sup{p : ES T +P < oo} 
q := sup{q : ES^ 9 < oo}, 
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can be considered, by abuse of language, the "number" of finite moments in 
underlying distribution. The moment formula makes explicit these relation- 
ships. 

Specifically, let us write P as a variable coefficient times \x\/T, the ratio of 
absolute-log-moneyness to maturity. Consider the limsups of this coefficient 
as x — ► ±00: 

P(x,T) 



/3 R (T):= Km sup 

x— >oo 

/? L (T):=limsup 



\x\/T 
l\x,T) 



\x\/T • 

One can think of /3r and (3l as the right-hand and absolute left-hand slopes of 
the linear "asymptotes" to implied variance. 

The main theorem in [Lee, 2002] establishes that j3r and /?£, both belong to 
the interval [0, 2], and that their values depend only on the moment counts p 
and q, according to the moment formula: 

- = l Pr 1 
P Wr 8 2 

q 2/?l 8 2" 
One can invert the moment formula, by solving for /3r and /?£: 



R = 2-4(y/p*+p-p), 



P L = 2-4(y/q*+q-q). 

The idea of the proof is as follows. By the Black-Scholes formula, the tail 
behavior of the implied volatility skew carries the same information as the tail 
behavior of option prices. In turn, the tail growth of option prices carries the 
same information as the number of finite moments - intuitively, option prices 
are bounded by moments, because a call or put payoff can be dominated by 
a power payoff; on the other hand, moments are bounded by option prices, 
because a power payoff can be dominated by a mixture, across a continuum of 
strikes, of call or put payoffs. 

In a wide class of specifications for the dynamics of S, the moment counts 
p and q are readily computable functions of the model's parameters. This oc- 
curs whenever log St has a distribution whose characteristic function / is ex- 
plicitly known. In such cases, one calculates ES7. 4 " simply by extending / 
analytically to a strip in C containing — i(p + 1), and evaluating / there; if no 
such extension exists, then ES T = 00. In particular, among affine jump- 
diffusions and Levy processes, one finds many instances of such models. See, 
for example, [Duffie, Pan & Singleton, 2000; Lee, 2001]. 
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APPLICATION 11.3.3 The moment formula may speed up the calibration 
of model parameters to observed skews. By observing the tail slopes of the 
volatility skew, and applying the momentformula, one obtains p and q. Com- 
bined with analysis of the characteristic function, this produces two constraints 
on the model parameters, and in models such as the examples below, actually 
determines two of the model's parameters. We do not claim that the moment 
formula alone can replace a full optimization procedure, but it couldfacilitate 
the process by providing a highly accurate initial guess of the optimal param- 
eters. 

EXAMPLE 11.3.4 In the double-exponential jump-diffusion model of [Kou, 
2002; Kou & Wang, 2001], the asset price follows a geometric Brownian mo- 
tion between jumps, which occur at event times of a Poisson process. Up-jumps 
and down-jumps are exponentially distributed with the parameters r\\ and 772 
respectively, and hence the means l/r/i and I/772 respectively. Using the char- 
acteristic function, one computes 

q = m p-rii-l. (3.6) 

Thus rj% and 772 can be inferred from p and q, which in turn come from the 

slopes of the volatility skew, via the moment formula. 

The intuition of (3.6) is as follows: the larger the expected size of an up- 
jump, the fatter the St distribution 's right-hand tail, and the fewer the number 

of positive moments. Similar intuition holds for down-jumps. Note that the 
jump frequency has no effect on the asymptotic slopes. 

EXAMPLE 1 1 .3.5 In the normal inverse gaussian model of Barndorff-Nielsen 
[Barndorff-Nielsen, 1998], returns have a distribution defined as follows: con- 
sider two dimensional Brownian motion with constant drift (8, 7), and let a be 
the Euclidean magnitude of this drift. The NIG distribution is the distribution 
of the first coordinate of the Brownian motion at the stopping time when the 
second coordinate hits a specified constant barrier. Then one can calculate 

q = a + 6 p — a — 5 — 1, (3.7) 

which also has intuitive content: larger a implies earlier stopping, hence thin- 
ner tails and more moments (of both positive and negative order); larger 5 
fattens the right-hand tail and thins the left-hand tail, decreasing the number 
of positive moments and increasing the number of negative moments. 

11.3.2. Statics under local volatility 

Assume that the underlying follows a local volatility diffusion of the form 
(2.1). Writing F := Se r ^ T ~^ for the forward price, suppose that local volatil- 
ity can be expressed as a function h of F alone: 

a(S,t) = h{Se r{ - T -^). 
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Hagan and Woodward (in [Hagan & Woodward, 1999], and with Kumar and 
Lesniewski in [Hagan et al, 2002]), develop regular perturbation solutions to 
(2.2) in powers of e := h(K), assumed to be small. The resulting call price 
formula then yields the implied volatility approximation 

I(K,T) « h(F) + ^h"(F)(F - K)\ (3.8) 

where F := (Fq+K)/2 is the midpoint between forward and strike. The same 
sources also discuss alternative assumptions and more refined approximations. 

REMARK 11.3.6 The reasoning of section 11.2.3. .2 suggests an interpreta- 
tion of the leading term h(F) in (3.8) as a midpoint approximation to the av- 
erage local volatility along a path from (i*b,0) to (K, T). 

11.33. Statics under stochastic volatility 

Now assume that the underlying follows a stochastic volatility diffusion of 
the form 

dS t = rStdt + VVtStdWt 
dV t = a{V t )dt + 0(V t )dZt 

where Brownian motions W and Z have correlation p. From here one obtains, 
typically via perturbation methods, approximations to the implied volatility 
skew /. Our coverage will emphasize those approximations which apply to en- 
tire classes of stochastic volatility models, not specific to one particular choice 
of a and (3. We label each approximation according to the regime in which it 
prevails. 

11.3.3..1 Zero correlation. Renault and Touzi [Renault & Touzi, 1996] 

prove that in the case p = 0, implied volatility is a symmetric smile - symmet- 
ric in the sense that 

I(x,T) = I(-x,T) 

and a smile in the sense that / is increasing in x for x > 0. 

Moreover, as shown in [Ball & Roma, 1994], the parabolic shape of / 
is apparent from Taylor approximations. Expanding the function C bs (v) := 
C BS (y/v) about v = EV, we have 

C = C bs {I) « C bs (EV) + (I 2 - ETO^r. 
Comparing this to a Taylor expansion of the mixing formula 

C = EC bs (V) « C bs (EV) + ivarty)^- 
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yields the approximation 
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/2 « E y + I^§£(^ -E? - 7(EF) 2 tY 
4 (EF) 2 \T 4 V ; /' 

which is quadratic in a;, with minimum at x = 0. 

REMARK 11.3.7 7b ?/?e extent that implied volatility skews are empirically 
not symmetric in equity markets, stochastic volatility models with zero corre- 
lation will not be consistent with market data. 



11.3.3..2 Small volatility of volatility, and the short-dated limit. Lewis 
[Lewis, 2000] shows that the forward call price, viewed as a function of x, has 
a complex Fourier transform given by H(k, V, T)/(k 2 — ik), where k is the 
transform variable and H solves the PDE 



OH 
dT 



t^&H , ., ^rili^H k 2 -ik TrT ~ T 



with initial condition H(k, V, 0) = 1. In our setting, H can be viewed as the 
characteristic function of the negative of the log-return on the forward price of 
S. 

Assuming that b(V) = r)B(V) for some constant parameter 77, one finds a 
perturbation solution for fj in powers of 77. The transform can be inverted to 
produce a call price, by a formula such as 



c = s 



Ke 



-rT 



27T 



L 



i/2+ °° c i kx H(k,V,T) ^ 
i/2-00 



k 2 — ik 



(3.9) 



yielding a series for C in powers of 77. From the C series and the Black-Schole 
formula, Lewis derives the implied variance expansion 



I 2 = EV + 77 



JW/_ 



X 1 

r IteF + 2 



+ ^ 



J(2) JO) / x 2 J J 

"^ r+ T V2(EF) 2 T 2 27W 8 

j( 4 ) / x 2 _x 4 -TEV 

+ T Vt 2 (EV0 2 + TEV 4TEV 



+ 



(JW)' 



5x 2 



12 + TEV \ 



2T \ 2T 3 (EF) 3 TEV 8T 2 (EV) 2 ) _ 



+ o( v 3 ), 



where Jv are integrals of known functions. 
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EXAMPLE 1 1.3.8 The short-time-to-expiry limit is 



I 2 (x,0) = V + 



1 pb 



2VVo 



x + 



A_H A &2 l P bd JP b ) 
12 48 P JVg 6V dV 



x 2 + 0{rf). 

(3.10) 

The leading terms agree to 0(rj) with the slow-mean-reversion result of section 
11. 3. 3. .5. We defer further commentary until there. 



EXAMPLE 1 1.3.9 In the case where 

dVt = K(9-V t )dt + r l V t v> dW u 

we have 

_ 1 _ e -*T 

Ev = e+ kT (Vo-e) 

and j( 2 > = 0, while 

jd) = I / T ( i _ e -*<?-)) \ e + e-™(v - e) 1 * +1/2 

« Jo L 



(3.11) 



ds 



J(3) 



2« 2 / ( 



1-e - 

2 /-T 



k(T-s) 



-«1\2 



+ e" 



W - *)] 



2<^5 



ds 



J(4) = ( <P + l)^J [e + e- K{T - S) (Vo-0)Y +1/2 J«HT,s)ds 

= /V" 
Jo 



J(6) 



(»—") _ e - KS } 



+ e -*(T-*)(y o -9) 



In particular, taking ip = l/2produces the Heston [Heston, 1993] square-root 
model. In the special case where Vq = 6, the slope of the implied variance 
skew is, to leading order in tj, 



dx kT\ 



1 



,-«r 



kT 



which agrees with a computation, by Gatheral [Gatheral, 2001], that uses the 
expectations interpretation of local volatility. 

11.3.3..3 The long-dated limit. Given a stochastic volatility model 

with a known transform H, Lewis solves for X(k) and u(k,T) such that H 
separates multiplicatively, for large T into T-dependent and F-dependent fac- 
tors: 

H(k,V,T)f* e-W T u(k, V), T -> oo. 

Suppose that X(k) has a saddle point at & e C where A'(A;o) = 0. Applying 
classical saddle-point methods to (3.9) yields 

. rT u(ko, V) exp[-A(fc )r + ik x] 



C(S, V, T) w 5 - Ke 



kl - ik y/2ir\"(k )T 
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By comparing this to the corresponding approximation ofC BS (I), Lewis ob- 
tains the implied variance approximation 

2 

I 2 (x) « 8A(&o) + (8Im(fc ) - 4)|; - 2A( ^ )r2 + 0(^ 3 ), T -+ oo. 

The fact that I(x,T) is linear to first order in ar/T agrees with the fast-mean- 
reversion result of Fouque, Papanicolaou, and Sircar [Fouque, Papanicolaou & 
Sircar, 2000]. We defer further commentary until section 11.3.3. .4. 

EXAMPLE 11.3.10 In the case (3.11) with ip = 1/2 (the square-root model), 
Lewis finds 



i 
1-p 2 


'1 

2 ~ 


KT) 


2(1 -P 


2)7/2 



X{ko) = 2(1 - p2)r?2 [ V(2* - PV) 2 + (1 - P 2 W ~ (2« - on) 

The sign of the leading-order at-the-money skew slope (8Im(ko) — 4) /T agrees 
with the sign of the correlation p. 

11.3.3..4 Fast mean reversion. Fouque-Papanicolaou-Sircar ([Fouque, 
Papanicolaou & Sircar, 2000]; FPS henceforth) model stochastic volatility as a 
function / of a state variable Yj that follows a rapidly mean-reverting diffusion 
process. In the case of Ornstein-Uhlenbeck Y, this means that for some large 
a, 

dS t = inStdt + f(Y t )S t dW t 
dY t = a(0 - Y t )dt + (3dZ t 

under the statistical measure, where the Brownian motions W and Z have 
correlation p. 

Rewriting this under a pricing measure, 

dS t = rStdt + f(Y t )StdW t 

dY t = HO - Y t ) - [3A(Y t )}dt + (3dZ t , 

where the volatility risk premium A is assumed to depend only on Y. Let py 
denote the invariant density (under the statistical probability measure) of Y, 
which is normal with mean and variance /? 2 /(2a). Let angle brackets denote 
average with respect to that density. Write 

^ := </ 2 >, 
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so that aoo is the quadratic average of volatility with respect to the invariant 
distribution. 

By a singular perturbation analysis of the PDE for call price, FPS show that 
implied volatility has an expansion with leading terms 

I(x,T)=A^+B + 0(l/a), 



where 



V 3 

7-7°+™-* 

-D •— (Too H = 1 

Coo 



and 

REMARK 11.3.11 The fast-mean-reversion approximation is particularly 
suited for pricing long-dated options; in that long time horizon, volatility has 
time to undergo much activity, so relative to the time scale of the option 's life- 
time, volatility can indeed be considered to mean-revert rapidly. 

Note that I(x,T) is, to first order, linear in x/T. This functional form 
agrees with Lewis's long-dated skew approximation (11. 3 .3. .3). 

REMARK 11.3.12 Today's volatility plays no role in the leading-order coef- 
ficients A and B. Instead, the dominant effects depend only on ergodic means. 
Intuitively, the assumption of large mean-reversion rapidly erodes the influence 
of today's volatility, leaving the long-run averages to determine A and B. 



REMARK 11.3.13 The slope of the long-dated implied volatility skew satis- 
fies 



£(»•-» 



As a consistency check, note that the long-dated asymptotics are consistent 
with the no-arbitrage constraint (3.4). Specifically, the T — ► oo skew slope 
decay of these stochastic volatility models achieves the 0(T~ l ) bound. 

APPLICATION 11.3.14 FPS give approximations to prices of certain path- 
dependent derivatives under fast-mean-reverting stochastic volatility. Typi- 
cally, such approximations involve the Black-Scholes price for that derivative, 
corrected by some term that depends on V2 and V3. 
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To evaluate this correction term, note that theformulas (3.12) can be solved 
for V2 and V3 in terms of A, B, and a.FPS calibrate A and B to the implied 
volatility skew, and estimate a from historical data, producing estimates of V 2 
and V3, which become the basis for an approximation of the derivative price. 

For example, in the case of uncorr elated volatility where p = 0, FPSfind 
that the price of an American put is approximated by the Black Scholes Amer- 
ican put price, evaluated at the volatility parameter 



V* 2 - 2V 2 , 

which can be considered an "effective volatility. " 

11.3.3..5 Slow mean reversion. Assuming that for a constant parame- 
ter e, 

da t = sa{V t )dt + Vi/3(V t )dW t , 

Sircar and Papanicolaou [Sircar & Papanicolaou, 1999] develop, and Lee [Lee, 
2001a] extends, a regular perturbation analysis of the PDE 

dC 1 2c , 2 d 2 C r „ a 2 C 1 o2 d 2 C dC BC 

m + 2 aS lw + ^ eps ^dSd-. + 2 e/3 ^ + £a ^ + rS ds = rC 

satisfied by the call price under stochastic volatility. This leads to an expansion 
for C in powers of £, which in turn leads to the implied volatility expansion 



I « ct + \fe 



£L x+ P?f T 
2<to 4 



+ e 



(m&}, + p\j + ((* + ?m,-z£ 






-.2 



24a 6 J r \\24a 6 J r 2 V2a 

where /?' := dP/da. In particular, short-dated implied volatility satisfies 

I{x,Q)^a Q + ^^-x. (3.13) 

2<7 

REMARK 11.3.15 The slow-mean-reversion approximation is particularly 
suited for pricing short-dated options; in that short time horizon, volatility 
has little time in which to vary, so relative to the time scale of the option's 
lifetime, volatility can indeed be considered to mean-revert slowly. 

Note that (3.13) agrees precisely with the leading terms of Lewis's short- 
dated skew approximation (3.10). 

REMARK 1 1.3.16 In contrast to the case of rapid mean-reversion, the level 
to which volatility reverts here plays no role in the leading-order coefficients. 
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With a small rate of mean-reversion, today's volatility will have the dominant 
effect. 

REMARK 11.3.17 For p y^ 0, the at-the-money skew exhibits a slope whose 
sign agrees with p. For p = the skew has a parabolic shape, 

REMARK 11.3.18 In agreement with a result of Ledoit, Santa-Clara, and Yan 
[Ledoit, Santa-Clara & Yan, 2001], we have I(x, T) — ► gq as (x, T) — ► (0, 0). 

APPLICATION 11.3.19 In principle, given a parametric form for b, the fact 
that the short-dated skew has slope pb gives information that can simplify pa- 
rameter calibration. For example, if the modelling assumption is that 
b = j3f(V)for some constant parameter (3 and known function f, then di- 
rectly from the short-dated skew and its slope, one obtains the product of the 
parameters p and f3. 

APPLICATION 11.3.20 Lewis observes, moreover, that this tool facilitates 
the inference of the functional form of b. Specifically, observe time-series of 
the short-dated at-the-money data pair: (implied volatility, skew slope). As 
implied volatility ranges over its support, the functional form of b is, in princi- 
ple, revealed. 

REMARK 1 1 .3.21 Note that the T — > skew slope is 0(1), which is strictly 
smaller than the 0(T~ 1 ' 2 )constraint. To the extent that the short-dated volatil- 
ity skew slope empirically seems to attain the 0(T~ 1 ' 2 ) upper bound instead of 
the 0(1) diffusion behavior, this observed skew will not be easily captured by 
standard diffusion models. Two approaches to this problem, and subjects for 
further research, are to remain in the stochastic-volatility diffusion framework 
but introduce time-varying coefficients (as in [Fouque, Papanicolaou, Sircar & 
Solna, 2002]); or alternatively to go outside the diffusion framework entirely 
and introduce jump dynamics, such as in [Carr & Wu, 2002], 

11.4 Dynamics 

While traditional diffusion models specify the dynamics of the spot price 
and its instantaneous volatility, a newer class of models seeks to specify di- 
rectly the dynamics of one or more implied volatilities. One reason to take / as 
primitive is that it enjoys wide acceptance as a descriptor of the state of an op- 
tions market. A second reason is that the observability of / makes calibration 
trivial. 

In this section, today's date t is not fixed at 0, because we are now concerned 
with the time evolution of /. 
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11.4.1. No-arbitrage approach 

11.4.1..1 One implied volatility. Consider the time-evolution of a sin- 

gle implied volatility / at some fixed strike K and maturity date T. Schon- 
bucher [Schonbucher, 1998] models directly its dynamics as 

dl t = u t dt + i t dW t {0) + v t dW u 

where W and W^> are independent Brownian motions. The spot price has 
dynamics 

dSt = rS t dt + <T t StdW t {0 \ 

where at is yet to be specified. 

Since the discounted call price e~ r ^ T ^ t >C BS {t ) St, It) must be a martingale 
under the pricing measure, we have for all /> the following drift restriction 
on the call price: 

Q C BS q C BS Q C BS 1 &C BS 

-dT + rS -dS- + u -dT + 2 a s ~dS^ 

a d 2 C BS 1 2 d 2 C BS „ BS 

This reduces to a joint restriction on the diffusion coefficients of I, the drift of 
/, and the instantaneous volatility a: 

Since S, t, and T are observable, we have that the volatility of /, together 
with the drift of I, determines the spot volatility. Other papers [Brace, Goldys, 
Klebaner & Womersley, 2000; Ledoit, Santa-Clara & Yan, 2001] have arrived 
at analogous results in which one fixes not (strike, expiry), but instead some 
other specification of exactly which implied volatility is to be modelled, such 
as (moneyness, time to maturity). 

Schonbucher imposes a further constraint to ensure that I does not blow up 
as t — » T. He requires that 



(P - a 2 ) - did 2 (T - t)v 2 + 2d 2 VT^ta'y = 0(T -t) t^T, (4.2) 

which simplifies to 

I 2 a 2 + 2^xla -I 4 + x V = 0. 

This can be solved to get expiration-date implied volatility in terms of expiration- 
date spot volatility. The solution is particularly simple in the zero-correlation 
case, where 7 = 0. Then, suppressing subscripts T, 



^^ 2 + /j + A 2 
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Under condition (4.2), therefore, implied volatility behaves as a + 0(x 2 ) for 
x small, but Od^l 1 ' 2 ) for x large. Both limits are consistent with the statics of 
sections 11.3.1. .2 and 11.3.3. .1. 

APPLICATION 11.4.1 Schonbucher applies this model to the pricing of other 
derivatives as follows. Subject to condition (4.2), the modeller specifies the 
drift and volatility of I, and infers the dependence of instantaneous volatility a 
on the state variables (S, t, I) according to (4.1). Then the price C(S, t, I) of 
a non-strongly-path-dependent derivative satisfies the usual two-factor pricing 
equation 

8C a 3C dC 1 2o2 d 2 C „d*C 1 2 d 2 C 

-m +rS ds +u -di + 2 aS ds^ + ^ S dSdi + 2 v ^F = rC 

with boundary conditions depending on the particular contract. Finite differ- 
ence methods can solve such a PDE, 

Care should be taken to ensure that I does not become negative. 

11.4.1..2 Term structure of implied volatility. Schonbucher extends 
this model M different maturities. The implied volatilities to be modelled are 
I t (K m , T m ) for m = 1, . . . , M, where T\ < T 2 < ■ ■ ■ < T M . Let 

V t (m) :=lf(K m ,T m ) 

be the implied variance. One specifies the dynamics for the shortest-dated 
variance V^- 1 ', as well as all "forward" variances 

v{m ,m + i) := (r m+ i-t)y' m+1 '-(r m -t)yH ^ 

J-m+l J-ra 

The spot volatility at and the drift and diffusion coefficients of V^ ' are jointly 
subject to the drift restriction (4.1) and the no-explosion condition (4.2). Then, 
given the at and V t dynamics, specifying each y( m ." l + 1 ) diffusion coeffi- 
cient determines the corresponding drift coefficient, by applying (4.1) to 

APPLICATION 11.4.2 To price exotic contracts under these multi-factor dy- 
namics, Schonbucher recommends Monte Carlo simulation of the spot price 
(which depends on simulation of implied volatilities). Upon expiry of the T\ 
option, the T 2 option becomes the "front" contract; at that time V^ coincides 
with V^ 1 ' 2 ', and at later times its evolution is linked to spot volatility via the 
drift and the no-explosion conditions. Similar transitions occur at each later 
expiry. 

Care should be taken to avoid negative forward variances. 
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11.4.2. Statistical approach 

Direct modelling of arbitrage-free evolution of an entire implied volatility 
surface remains largely unresolved. Unlike traditional models of spot dynam- 
ics, direct implied volatility models face increasing difficulty in enforcing no- 
arbitrage conditions, when multiple strikes are introduced at a maturity. 

Instead of demanding no-arbitrage, the modeller may have a goal more sta- 
tistical in nature, namely to describe the empirical movements of the implied 
volatility surface. According to Cont and da Fonseca's [Cont & da Fonseca, 
2002] analysis of SP500 and FTSE data, the empirical features of implied 
volatility include the following: 

Three principal components explain most of the daily variations in implied 
volatility: one eigenmode reflecting an overall (parallel) shift in the level, an- 
other eigenmode reflecting opposite movements (skew) in low and high strike 
volatilties, and a third eigenmode reflecting convexity changes. Variations of 
implied volatility along each principal component are autocorrelated, mean- 
reverting, and correlated with the underlying. 

To quantify these features, Cont and da Fonseca introduce and estimate a 
d-factor model of the volatility surface, viewed as a function of moneyness m 
and time-to-maturity r. The following model is specified under the statistical 
probability measure: 

d 
log I t (m, r) = log I (m, r) + £ % (fc) / (fc) (m, r), 

where the eigenmodes f( k \ such as the three described above, can be estimated 
by principal component analysis; the coefficients j/ fc ) are specified as mean- 
reverting Ornstein-Uhlenbeck processes 

dy^ = -\W(yi k) - yW)dt + v^dW t {k) . 

REMARK 11.4.3 If one takes y\ ' = Ofor allk, thenl(m,r) does not vary 
in time. This corresponds to an ad-hoc model known to practitioners as "sticky 
delta, " Balland [Balland, 2002] proves that if the dynamics ofS are consistent 
with such a model (or even a generalized sticky delta model in which It(m, r) 
is time-varying but determinstic), then assuming no arbitrage, S must be the 
exponential ofaprocess with independent increments. 

APPLICATION 11.4.4 A natural application is the Monte Carlo simulation 
of implied volatility, for the purpose of risk management. 

However, this model, unlike the theory of section 11.4.1., is not intended 
to determine the consistent volatility drifts needed for martingale pricing of 
exotic derivatives. How best to introduce the ideas from this model into a no- 
arbitrage theory remains an open question. 
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Abstract 

Let {W t t : s,t € [0, 1]} be the Brownian sheet. We define the regularized 
process W't as the convolution of W s t and ip € {s,t) = j%tp (f ) y> (j) where <p 
is a function satisfying some conditions. For lj fixed we prove that 

a({(M) S [0,1],[0,1]:^^<,})^ $ (,) 

almost surely, where A is the Lebesgue measure in R 2 , $ is the standard Gaus- 
sian distribution and || • |J 2 is the usual norm in L 2 ([— 1, 1] , dx). These results 
are generalized to two parameter martingales M given by stochastic integrals 
of the Cairoli & Walsh type. Finally, as a consequence of our method we also 
obtain similar results for the normalized double increment of the processes W 
and M. These results constitute a generalisation of those obtained by Wschebor 
for Brownian stochastic integrals. 

Keywords: Wiener process, Brownian sheet, double increment 



12.1 



Introduction 



Several works have been recently devoted to study the problem of estima- 
tion of a process {X t } when one observes the process at discrete times i.e. Xk 

n 

or the observation is the smoothed process Xf = f^ 00 ( p(^r)X s ds, where 
tp is a smooth kernel and e is a window parameter. In each case the asymp- 
totic behavior of the estimators is established when the step of observation 
1/n or the window e respectively, tend towards zero. This type of problems 
are important when the observation device allows improving the resolution. 
The case where the observed process is a Brownian diffusion has been studied 
by Genon-Catalot and Jacod in the discrete case and in Wschebor and Perera 
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and Wschebor in the other one. In this work we consider the same type of 
problems when we observe a regularization by convolution of a random field, 
which is solution of a stochastic differential equation driven by a Brownian 
sheet or more generally a Carioli- Walsh stochastic integral with respect to the 
Brownian sheet. We restrict our study to the law large number type result, the 
CLT will be considered elsewhere. 

Let us introduce the problem. Wschebor has shown that, for almost every 
u>, the increments of the Wiener process B — {Bt,t G [0,1]} as a function 
of time converge in distribution towards a standard Gaussian distribution $. 
Namely, he proved that if A e (£) = e~ 1 ^ 2 (Bt+e — Bf) denotes the normalized 
increments of such a process and m is the Lebesgue measure in R then almost 
surely m({t € / : A e (t) < x}) — ► m (I) <J> (x) when e — ► where / is any 
interval in [0, 1] and x € R. Moreover he defined the process Bf = B *Q (t) 
as the convolution of Bt and Q (t) = e~ l (p(t/e), a convolution kernel that 
approaches Dirac's delta function as e — > 0, and he showed that almost surely 

m(< t E I : jr-ip [) — ► m (I) $ (x) when e — > where || • ||2 is the usual 

norm of L 2 ([— 1, 1] , dm) . By taking <p (x) = l[_i,o] ( x ) the result for the nor- 
malized increments is a particular case of this. Finally, Wschebor generalized 
these results to the class of stochastic processes TV given by N t = / ip s dB s 
where ip satisfies certain regularity conditions, and obtained that almost surely, 




lim m 

£-.0 

where Nf denotes the convolution of Nt and ^ e (t) and F s is the distribution 
function of a centered normal variable with random variance equal to ip 2 . 

In this article we follow Wschebor's method to generalize the above results 
to the case of the Brownian sheet instead of the Brownian motion and to the 
case of strong martingales, i.e., stochastic processes M given by 



f,t = f V *■ 

Jo Jo 



M 9t = / *uvdW w (1.1) 

Jo Jo 

where the integral considered is the stochastic integral of Cairoli and Walsh 
instead of the stochastic integral of Ito type. Note however that in this case the 
procedure is a little more involved due to the dimensional nature of the time 
parameter. 

These results are interesting because they give a way to obtain nonparametric 
estimators of the coefficient a for two parameter stochastic differential equa- 
tions: 

dX st = a (X at ) dW st + b (X st ) dsdt 
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These models have been studied, for example, in Carmona and Nualart. We 
apply our results to this case, see the Remark of Theorem 3 and Corollary 2 
bellow. 

12.2 Assumptions and Notations 

1 On the process W: {W s t : s,t E [0, 1]} is a Brownian sheet. In what 
follows, we shall suppose that W s t is defined for all s,t € R setting 
W 8t = 0if*g[0,l]ortg[0,l]. 

For a rectangle A = (s, s'] x (t,t'], W (A) will denote the double incre- 
ment over A, i.e. W {A) = W s > t ' - W st > - W s > t + W st . 

2 On the process M: {M s t : s,t € [0, 1]} is a two parameter strong mar- 
tingale given by (1.1) where ^ is a process satisfying the conditions of 
Cairoli and Walsh for this kind of integral. Also we suppose that M 9 % is 
defined for all s,t 6 R setting M st = if s £ [0, 1] or t £ [0, 1] . Fi- 
nally, M (A) will denote the double increment of M over the rectangle 
A defined as before. 

3 On the kernel <p: suppcp C [—1,1], (pis the distribution function of a 
(signed) measure d<p(x) which has bounded total variation and J_ x ip (x) 
dx = 1. 

Throughout the paper we shall consider Wl t and M% t the regularization by con- 
volution of ip e (s,t) = e~ 2 ip (j) <p (j) with Wst and M st respectively and 

MMHlufilf-rf wh ere 



\\v\\i 



d 2 wu 

dsdt 



— I J W s - eu ,t-evdip{u)dtfi(v) 



Note that Z e (s, t) has standard normal distribution for each s, t € [e, 1 — e] and 
that Z B (s, t) = if s, t $ [e, 1 — e]. Finally, £ will denote a standard normal 
variable, C shall stand for a generic constant whose value change during a proof, 
A £ (s, t) will denote the square (s, s -f- e] X (t, t 4- e] and /, J will be arbitrary 
intervals in [0, 1). 

12.3 Results 

Theorem 4 Ife-*0, then 

X({(s,t)eIxJ:Z £ (s,t) <x}) -> A(/x J)*(x) 
almost surely for all x E R. 

Remark. Taking tp (x) = 1[_ 1)0 ] (x) we have that Z £ (s, t) = YUMsM , so 
Theorem 1 holds for the normalized double increment of the Brownian sheet 
over A s (s,t). 
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COROLLARY 2 If f,g : R —> R are continuous functions and g satisfies 
E\g(0\ 1+S <Cthen 

f f f(W* t )g(Z e (s,t))dtds->E(g(0) f [ f (W st ) dtds 
Jo Jo Jo Jo 

a.s when e — > 0. 

Theorem 5 Let Ty (e) = sup{|* st - V s 't>\ ■ \s - s'\ < e, \t - t'\ < e}.If 
T* (e) (log log (l/e 2 )) tends to zero a.s when e — > then 

A ({(s,t) € / x J : M( ^ (M)) < x\) - J f F st (x) dsdt 

a.s when e — > 0, where F s t is the distribution function of a centered normal 
variable with random variance \&^ t . 

Theorem 6 Let Tj (e) = sup (e \V 3t - V s > t >\ 2j : |s - s'\ < e, \t - t'\ < e\. 

If \& has continuous paths and for each j there exist positive constants Cj and 
7j such that Tj (e) < CjE^ then 



A ({ (M) € ' x J : jk^ - x \) - 11/- {x)dsdt 



d 2 MI 

Ml dsdt 

a.s when e — ► 0. 

COROLLARY 3 Under the assumptions of Theorem 6 and Corollary 1, 

-+ f [ f(M st ) [ J. r g{x)e~^Ttdx J dtds 
Jo Jo \V2nW st J-oo J 

a.s when e — * 0. 

Remark. Theorem 6 can be applied to X% t that is a regularization of the 
process solution of the equation: 

dX st = a (X st ) dW st + b (X st ) dsdt 

obtaining 

x ({ is - t)e ' xj: ^M~ x })^LI/ n(x)dsdt 
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G st (x) = * f e ^fe)du. 

a(Xst) J-c 



'2Tra(X st ) J-oo 
Moreover taking / = 1 and g(x) = x 2 Corollary 2 gives 

2 1 t 

f f a 2 {X st )dsdt 
Jo Jo 



ff 

Jo Jo 



e d 2 X s st \ , ^ 
n ' dsdt 



Ml dsdt 



a.s. 

12.4 Proofs 

We can assume for simplicity sake that I = J = [0, 1]. The proof for 
general intervals can be treated in a similar fashion, with some minor modifi- 
cations. 



12.4.1. 



Proof of Theorem 4 



First, we observe that it is sufficient to prove the convergence of the moments 
of Z £ (s,t) , as a random variable in the time parameters, to the moments of 
a standard normal variable, i.e. to prove that Vk (e) = / / Z £ (s, t) dsdt 
tends to E (£ fc ) a.s when e — > 0, for all k > 1. 

Computing covariances we can show that Z e (s, t) and Z e (s', t') are indepen- 
dent if|s — s'\ > 2e or \t — t'\ > 2e. Using this fact we can see that 



var 



(V k (e))= f f f f cov(Z £ (s,t),Z £ (s',t'))dt'ds'dsdt<C, 
Jo Jo Jo Jo 



splitting conveniently the integrals. Therefore, if e v = v~ a a > 1, the Borel- 
Cantelli Lemma implies that 14 (e„) — ► E (£ fe ) a.s when v — > +00. To fin- 
ish the demonstration we have to show that sup |I4 {e v ) — Vk{e)\ — > 

£i/-l-i<e<£i/ 
when v — * +00. 
Start with \Vk {e v ) - V k (e)\ < J\ + J 2 where 



Ji = 



ll^ll* 



1 r 1 (d 2 w e j\ k 



Jo Jo \ 



st, 

dsdt 



dsdt 



and 



Ji 



\w\\f 



(d 2 Wl t \ k 



iimr^-ia 



fd 2 W e J 



dsdt 



dsdt 
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As £ v+ i < e < e v for term J\ we have J\ < 



t~k 



3 2 w* 



|V fc (e„)| -> a.s 



when i/ — » +oo. Next we define A(s,t,e:) = g s gf and B (s,t,e,e u ) — 
A(s,t,E) — A(s,t,e„) and using the identity 

(vl + £)* - A k = ^ ( k ) A j B k ~i 

we obtain that 

IMI2 JS)-^ ^o 

By the appendix with e„ < s, t < 1 — e„ we conclude that |yl (s,i,£ „)| J < 
£ - j{l+5) Kl and |5 (a,t l e,e 1/ )| fc - J ' < e^Vf "^ (i/) where, for < «J < 
1/2, 

.2(1-5) 



fr a (i/) = 2fr v -^ 



-v+i 



ev+i 



M 



and if^, is a constant dependent on cp. Therefore, 



Ji< 



C ^ e k . d 



: 2fc 



E^^'^'V) 



IMir^ e H-i d 



For <5 small enough e v 3 H$ (v) — > when f — > +oo for all 0< j < k — 1. 
So J2 tends to zero when f —> +00 and this completes the proof of Theorem 
1. 



12.4.2. Proof of Corollary 1 

Note that 

"1 /•! 



1/ / f{Wl t )g{Z e {s,t))dsdt-E{g{Q) f f f (W st ) dsdt 

\Jq JO JO Jo 



< Q1+Q2 



where 



and 



Qx= sup \f (WS t ) - f (Wst)\ f f \g(Z £ (s,t))\dsdt 
s,te[o,i] Jo Jo 



Q2 = I / If {Wst) 9 (Z £ (s, t)) dsdt -E(g (0) / / / Wt) dsdt 
Uo Jo Jo 7o 
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Using the Dominated Convergence Theorem we have that lim WU = W s t. 

e— <0 
Therefore, the continuity of / and the boundedness of W% t and W s t imply that 

sup \f(W* t )-f(W st )\^0 
s,te[o,i] 

when e — > 0. To see that Q\ tends to zero when e — ► it is sufficient to observe 
that by Theorem 1 and the assumption on g we have that 



/ / \g(Z e (s,t))\dsdt->E\g(Q\ 
Jo Jo 



when e — » 0. The convergence to zero of Q2 can be obtained from Theorem 4 
by a standard approximation argument. 

Remark. Following the proof of Corollary 1 we can show that 

f / uvf (W£ v ) 9 (Z e (u, v)) dvdu -*E(g (0) [' [ uvf (W uv ) dvdu 
Jo Jo Jo Jo 

a.s when e — *• 0. Therefore, if/ is bounded we have by Theorem 6.1 of Cairoli 
and Walsh [1] that there exists a process {(f)(x;s,t) : x € R,s,t 6 [0, 1]} which 
is a.s. jointly continuous in x, s and t such that 

ps rt /"+0O 

/ / uvf (W uv ) dvdu = / (f)(x;s,t)f(x)dx 
Jo Jo J-00 

almost surely. Hence, we obtain an a.s. approximation of this kind of local 
time for the Brownian sheet. 

12.4.3. Proof of Theorem 5 

We have 

M(A £ (s,t)) W(A e (s,t)) 1 rs+e r t+e 



! i rs+E rt+e 

£ Js Jt 



, Wm 

e £ £ Js Jt 

(4.1) 
Using Theorem 4 and the remark at the end of it, we have that 

lim A (j( S ,t) 6 [0,1] x [0,1] : W(Ma,t)) ^ ^ ^ 

= I f F at (x) dsdt 
Jo Jo 

a.s. Hence, if we denote by Qf the second term in the right hand side of (4.1), 
it is enough to prove that Qf tends to zero when e — » for almost all w and 
s,t G [0, 1] to finish the proof of Theorem 2. 
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First, notice that U** = eQ^ is a stochastic integral in the plane. Therefore, it 
is a martingale and a /-martingale, i = 1,2 (see Cairoli and Walsh). Hence, 
for fixed s,t £ [0, 1], {U^, e > 0} is a martingale with increasing process 

/S+£ /•£+£ 
/ [«Vf - 9 st f dt'ds' < e 2 Tl (e) 

Therefore, using the time change theorem, the law of iterated logarithm and 
our assumptions we have the desired result. 

12.4.4. Proof of Theorem 6 

As in the proof of Theorem 4, it is sufficient to show the a.s convergence of 
moments of order k > 1 to E (£ fe ) $ J* 9 k st dsdt. 

Using the differentiation formulas of pages 224 and 226 of Farre and Nualart 
with Q (s) = £~ V ( s / £ ) we obtain that 



d 2 Ml t _ d 2 W at 



dsdt dsdt " J R 2 

Taking 



■e f 

f *st + J 2 Ce {s - a') Q (t - t') [9m - 9 st ] dW s , v 



e d 2 Wf 

y\\l dsdt 

and 



T I ±\ £ ° VV St T 



J e (a, t) = -^J Ce {s ~ 8') ( £ (t - t') [9 s , t , - 9 st ] dW a . ltf 



IMI 

we have that 



Jo Jo 






2 dam) dsdt = T,( k j) U ' & k ~ 3) + U * ( fc > °) ( 4 - 2 > 

with u e (i,j) = Jo /o i e (s, ty j e (s, ty dsdt. 

Theorem 4 implies that U e (fc,0) -> E (£ fe ) J^ J* 9 k t dsdt a.s when e -> 0. 
So, to finish the proof we have to show that the first term in the right hand side 
of (4.2) tends to zero a.s when s — > 0. Using the inequality from Theorem 2.1 
of Guy on and Prum we have that for any positive integer h 

E\J £ (s,t)\ 2h < 

_4/i-2 rs+e pt+e 

C flST / / £ H ( s - s ') £ h (* - f) E [*,, - 9 st ] 2h dt'ds' 

||V?|| 2 Js-e Jt-e 
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\2h 



Because of the hypothesis on T we obtain that E\J e {s,t)\ < C Che"" 1 . 
Therefore 

e(( f \J e (s, t)\ 2h \ dsdt < C c h e* 

Using the Borel-Cantelli Lemma with e u = v~ a as in Theorem 1 we obtain 
that U Ev (0, 2h) — > a.s when v — ► +oo for any positive integer h. So, 

U £u (j, k - j? < U ev (2j, 0) U ev (0, 2 (fc - j)) - 
almost surely when v — > +oo. Hence 



Jo Jo 






-1 /•! 



imii «»* 



P)JA*** 



a.s. when // — ► +oo. 

Finally, we can obtain analogous results to those of the appendix for the process 

M, and proceeding as in Theorem 1 we have the result. 

Appendix 

In this appendix, we show how to obtain the bounds for terms \A (s, t, e v )\ and \B (s, t, e, e v )\ 
used in the proof of Theorem 1. 



d'w. 



+oo 



Recall that A (s, t, e) = 0j0 {* and B (s, t,e,e v ) = A (s, t, e) — A (s, t, e v ) . Using /_ 
dtp (x) = we have that 

1 /- + 0O /"+0O 

eA(s,t,e) = - / W(D £ (s,*,u,T>))d<p(u)dy>(u) 

^ •> — oo •/— oo 

where £ < s, £ < 1 — £ and Z? £ (s, t, u, «) = (s — e,s — eu] x (4 — e, £ — ev] . So, because 
of the modulus of continuity of the Brownian sheet (see Csorgo and Revesz or Orey and Pruitt) 
and the hypothesis on ip we obtain that 

\eA{s,t,e)\<C- e e {l - S) ( f °° d\tp\{x)) = e~ 6 K v 



with |y?| the total variation of yi.Thus 

\A(s,t,e)\<e- {1+6) K v 
Regarding the other term, we observe first that 



(A.1) 



\eB(s,t,e,e v )\ < 



d 2 W! t _ d 2 W t 



2 U/ S " 
st 



T It-i, 



dsdt dsdt ' e v T" dsdt 

the second term in the right hand side of the above equation can be bounded, using (A. 1), by 

£f+l 



l e -g"L -f K < 



l 



eZ* K v 
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and the first by 

-5— / [/i(s,*,u ) u 1 £ ) e„) + / 2 (s,t, «,«,£,£.>)] d\ip\(v)d\ip\{u) 
£ v+i Jr 3 

with 

fi (s,t,u,v,e,e u ) = \e- e u \ \W(D e „ (s,t,u,v))\ 

and 

f 2 {s,t,u,v,e,e u ) = e v \W{De u (s,t,u,v))-W (pctv (s,t,u,v))\ 

where D Eftv (s,t,u,v) = {s — £v,s — eu] x (t — e v ,t — ev] for£„ < s,t < 1 — e u . Using 
again the modulus of continuity of the Brownian sheet, the function /1 can be bounded by 

CI- £k±1 ei ~ ' for < 6 < 1/2. With respect to function fc, it is enough to study 

the shape of the rectangles D tv (s, t, u, v) and D eie „ (s, t, u, v) for distinct values (positive or 
negatives) of u and v and to use the modulus of continuity to obtain that 

\W{D S „ (s,t,u,v)) - W (Z5,,,„ (s,t,u,v))\ < C [e v {eu - £,+i)j i_<S 
Hence, 

eB(s,t,£,e») < K v £^- 1 1 - £*±i | + |l - Sk±i| ^ e"' 

I , |i _{ 2(1-4) 

or equivalently |fi (s,t,£,£^)| < e'^Hs {v). 
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Abstract We present Poisson,approximation results for additive functionals switched by 

Markov and semi-Markov processes. The weak convergence results are ob- 
tained via semimartingale representations of additive functionals and the con- 
vergence of generators for Markov processes and of compensative operator of 
the extended Markov renewal processes. This is a review paper of our previous 
results given in [Korolyuk, 2002; Korolyuk, 2002A]. 

Keywords: Additive functional, Poisson approximation, Compound Poisson approximation 
with drift, Markov, semi-Markov switching, semimartingale, compensative op- 
erator, extended Markov renewal process. 

13.1 Introduction 

Poisson approximation is a very active research field [Aldous, 1989; Bar- 
hour, 1992; Barbour, 2002]. Three kind of Poisson processes approximation 
exist: standard Poisson process [Aldous, 1989; Barbour, 1992], compound 
Poisson process [Barbour, 2002], and compound Poisson process with drift 
[Korolyuk, 2000; Korolyuk, 2001A; Korolyuk, 2002]. 

A compound Poisson process with drift (CPPD) is defined as follows 

£(t) = at + J2 a k, *>0, (1.1) 

k=l 

where (c*fc)is a real i.i.d. sequence, v(t),t > Ois a time-homogeneous Poisson 
process and a € H. 
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The results we present here are a review of our previous results [Korolyuk, 
2000; Korolyuk, 2001A; Korolyuk, 2002; Korolyuk, 2002A], and concern ap- 
proximation of additive functionals by CPPDs like (1.1). 

Additive functionals of stochastic processes play an important part in the- 
ory and in many applications [Korolyuk, 1999; Korolyuk, 1999A; Korolyuk, 
1999B; Korolyuk, 2001; Korolyuk, 2000A; Korolyuk, 2000B; Korolyuk, 2000C; 
Korolyuk, 2002]. We have obtained diffusion approximation of additive func- 
tionals with Markov switching with and without balance condition in [Ko- 
rolyuk, 2000A; Korolyuk, 2000B; Korolyuk, 2000C], and Poisson approxi- 
mation for increment processes and their stochastic exponentials with Markov 
switching in [Korolyuk, 2000]. In the above cases, we have worked in the 
settings of the books [Jacod, 1987] and [Ethier, 1986], where the martingale 
characterization is used. We have also obtained results of CPPD approxima- 
tion for integral functionals with semi-Markov switching [Korolyuk, 2002]. In 
the latter case, due to the semi-Markov process, the martingale characterization 
does not further works, hence a need for more adapted tools. In fact, we make 
use of the compensative operator for extended Markov renewal processes, in- 
troduced by Wentzel & Sviridenko [Sviridenko, 1989], from which we derive 
the martingale characterization. 

Consider a sequence of r.v.s ak, k > 0, and a multivariate point process 
Tk,Xk, k > 1, [Anisimov, 1995; Borovskikh, 1997; Jacod, 1987; Korolyuk, 
1999; Liptser, 1989], with counting process u(t) := inf{fc > 1 : r^ < t}. The 
stochastic process C(0>* ^ 0» defined by 

»{t) 

C(<):= $>*(**' x *-i)« t>0, (1.2) 

fc=i 

is called an increment process [Borovskikh, 1997; Jacod, 1987]. We study 
the increment process with Markov switching as an additive semimartingale 
[Cinlar, 1980]. If the r.v.s a^, k > 1, are iid and the multivariate point process 
Tfc, Xk, k > 0, 0k+i = Tk+i — Tk, is just a renewal point process on ft + , then 
(1.2) is called a compound process or a renewal reward process [Osaki, 1985]. 
If the r.v.s a^ , k > 0, are iid and the multivariate point process Tk,Xk, k > 
0, is just a Poisson point process on R + , then (1.2) is called a compound 
Poisson process [Osaki, 1985]. If afc is a fixed function defined on R x £, 
then (1.2) is a shot noise process [Parzen, 1999], which play an important 
role in the theory of noise of physical devices. In [Kluppelberg, 19951 the 
authors consider the random measure afc(rfc) and a Poisson process v(t), and 
they derive asymptotic results for (1.2) with application in insurance. For a 
semimartingale representation, see [Borovskikh, 1997; Cinlar, 1980; Jacod, 
1987; Liptser, 1994; Liptser, 1989; Liptser, 1991]. 
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The Additive functionate that we consider are of the following form 

£(*):= f r]{ds;x(s)), t > 0, (1.3) 

./o 



where the switched process 7y(t, o;),x G E is a Markov process with locally 
independent increments and the switching process x(t) is a semi-Markov pro- 
cess, with state space E. This additive functional is a continuous functional. 

In fact, an additive functional can also be represented by the sum of an 
increment process and of another term, i.e., 

«/(t)-i 

£(*)= E r K (9 fc; ;r fc-i) + r ?( t ~ 7 v«; :E ( t ))- 
fc=i 

This kind of processes are widely used in applications, i.e., risk and stor- 
age theory [Prabhu, 1980], reliability and maintenance theory [Osaki, 1985], 
finance and insurance [Kluppelberg, 1995], noise of physical device [Parzen, 
1999], etc. In applications, Ok+i is the acting time of the k th event and a^ is 
its magnitude, its cost, etc.. 

In many applied problems the r.v.s a^ depend on the environment. For 
example, the cost of a damage for an insurance company depends on which 
place, time, weather, etc. it happens. In the case where we have a multistate 
environment, E say, we suppose that the r.v.s a.^ depend on the state x € E, 
denoted afe(x). 

The increment process considered here is based on a multivariate point pro- 
cess which corresponds to a Markov renewal representation of a Markov pro- 
cess and the r.v.s a^ depend on the states of that Markov process. The conver- 
gence of the increment process towards a compound Poisson process with drift 
is due to the fact that we assume the r.v.s a^, take small values with big prob- 
abilities and big values with small probabilities. Small jumps are transformed 
into deterministic drift. 

In Section 2, we define continuous and discontinuous additive functionals. 
In Section 3, we give weak convergence results of the increment processes to- 
wards compound Poisson processes with drift. In Section 4, we consider an 
asymptotic split phase space for the switching Markov process and give Pois- 
son,approximation results of the increment processes. In Section 5, we give 
Poisson approximation results for an additive functional with semi-Markov 
switching process. Finally, in Section 6, we give the main steps of proof of the 
theorems. 



282 RECENTS AD VANCES IN APPLIED PROBABILITY 

13.2 Preliminaries 

Let us consider a time -homogeneous cadlag stochastic process x(t),t > 
with values in a Polish space (E, £). Times = tq < t\ < ■ • ■ denote the 
jump times and define the embedded process x n = x(T n ), n > 0. 

Let rj £ (t;x), t > 0, x E E, e > 0, be a family of homogeneous Markov 
jump processes in the Euclidean space H , d > 1, defined by the generators 

T e (x)<p(u) = e~ x f [<p(u + v)- (p(u)]r e (dv;x), x E E. (2.1) 

Jn d 

The results presented here concern the following additive functionals: 

u(t/e) 

k=i 
and 

?(t):= f V £ (ds;x(s/e)), t>0. (2.3) 

In the first case (2.2) we suppose that the process x(t), t > is a Markov pro- 
cess, and that it is uniformly ergodic with stationary distribution n(B), B E £. 
Thus the embedded Markov chain Xk, k > 0, is uniformly ergodic too, with 
stationary distribution p{B), B E 8, related by the following relation 

n(dx)q(x) = qp(dx), q := / n(dx)q(x). (2.4) 

Je 

In the sequel we will suppose that 

< <7o < q{x) <qi< -r-oo, x E E. (2.5) 

In the second case (2.3), we suppose that x(t),t > is a semi-Markov 
process with semi-Markov kernel 

Q(x, B, t) = P(x, B)G x (t), xeE, B E£,t>0, (2.6) 

which defines the associated Markov renewal process (x n , r n ; n > 0) by : 

Q(x, B, t) = P(x n+ i E B, 9 n+ i <t\x n = x) 

= JP(x n+1 EB\x n = x)P(0 n+ i < t | x n = x). (2.7) 

The semi-Markov process defined by the semi-Markov kernel (2.6) is a spe- 
cial case whose G x (t) does not depend on the next visited state. Nevertheless, 
this is not restrictive since any semi-Markov process can be transformed into 
the above form, see [Limnios, 2001]. 
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Here # n+1 := r n+ \ — T n , n > 0, are the sojourn times given by the distri- 
bution functions 

G x (t) = P(0 n +i <t\x n = x)=: W(9 X < t). (2.8) 

The embedded Markov chain (x n , n > 0) is defined by the stochastic 
kernel 

P(x, B) = P(x n+ i € B | x n = x). (2.9) 

We suppose that the semi-Markov process x(t), t>0, is regular [Limnios, 
2001], that is to say 

P x (z/(i) < oo) = 1, for all x G E, and t € M+, (2.10) 

with the counting process 

v(t) = max{n > : r n < t}. (2.11) 

The additive functional (2.3) can be represented by the sum 

f(t/e)-l 

e(t)= E ^(^ +1 ;x n ) + r ? £ (^(t);x(t/e)), (2.12) 

n=0 

where 9 e (t) := t/e - r(t), r(t) := T v{t) . 

We will present Poisson,approximation results of functionals (2.2) and (2.3) 
by a semimartingale approach. In both cases, the limit processes are compound 
Poisson processes with drift. 

The semimartingale approach used here is interesting not only because it 

offers a general framework for convergence of stochastic processes but also 

because the semimartingale representation of additive functionals is obtained 

by using the Poisson approximation conditions for distribution functions of 

jumps. 

13.3 Increment Process 

Let us introduce the convergence-determining class of functions C3(1R), 
(see [Jacod, 1987], VII.2.7). This class is characterized by the following 
condition: g € Cs(JR) is a real-valued bounded continuous function with 
g(u)/u 2 — ♦ as |ix j — * 0. 

Let us consider the additive functional ( £ (t) given in (2.3). 
Assumptions (A) 

(Al:) The switching Markov jump process x(t), t > 0, is uniformly ergodic 
with the stationary distribution (2.4). 
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(A2:) The family of random variables a%(x), k > 0, x 6E E is uniformly 
square integrable, i.e., 

sup sup / u 2 $l(du) — y 0, as c — ► oo. 

£>0xeEJ\u\>c 
(A3:) Approximation of mean value 

/ u$%{du) = e[a{x) + 6 e (x)}, 
Jn 

and sup^jg^ |a(a;)| < a < oo. 

(A4:) Poisson,approximation condition 

j g{u)$%{du) = e[$ x (g) + e e g (x)}, g e C 3 (R), 

and sup^ \Q x (g)\ < $(s) < oo. 
(A5:) Square-integrability condition 

sup / u 2 $ x (du) < +oo. 
xeEJn 

where the measure $ x (du) is defined by the relation (see [Jacod, 1987]) 

**($) = f g(u)Q x (du), g e C 3 QR). 
Jn 

The negligible terms 9 e (x) and 6g(x) in the above conditions satisfy: 
sup|0 £ (:c)|^O, sup|^(x)|^0, e^O. 

x€E xeE 

THEOREM 1 Under Assuptions A1-A5, the increment process (2.2) con- 
verges weakly to the compound Poisson process with drift 

Mt) 
( (t):=Y,c» k + tqao, t>0. (3.1) 

fc=i 

The distribution function <£>°(tt) of the iid random variables a k , k > 0, is 
defined on the measure-determining class Cs(lR) of functions g by the relation 

Eg(a° k ) = f g(u)$°(du) = $(g)/*(l), g € C 3 (R), (3.2) 
Jn 
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where 

$(g) := I p(dx)* x (g), *(1) := / p(dx)* x (l). (3.3) 

Je je 

The counting Poisson process Vo(t) is defined by the intensity 

go := g*(l). (3.4) 

The drift parameter o,q is defined by 

a = 6 - *(l)Ea?, a := / p(dx)a(s). (3.5) 

The following corollary concerns the case where the state space E is finite. 

COROLLARY 1 The increment process (2.2) with a finite number of jump 
values: 

TP(a%(x) = a m ) = ep m (x), 1 < m < M, 

P(a|(x) = ea Q ) = 1 - ep {x), (3.6) 

M 

p (x) = ]Tp m (:r), 

converges weakly to the compound Poisson process (3.1) determined by the 
distribution function of jumps: 

X>(a° k = a m ) = p m , 1 < m < M, 

P m = Pm/Po, Pm= p(dx)p m (x), 1 < m < M. (3.7) 

The intensity of the counting Poisson process i>o(t), t > 0, w defined by 

9o := 9Po, (3.8) 

a«c? ?Ae drift parameter ao is given in (3.6). 

Example. Let us assume that the ergodic process x(t),t > takes values 
in E = {1,2}, and has generator matrix Q, 

\ A* —A* 

The transition matrix of the embedded Markov chain is 

1 1 
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Thus, the stationary distributions of (x(t)) and (x n ) are respectively: 

Now, suppose that, for each e > 0, the random variables af.(x),x — 
1,2, k > 1 take values in {eao,ai} with probabilities depending on the state 
x, i.e., $ x (ea ) = P(a| = ea Q ) = l-ep x and $ x (ai) = P(a;| = a x ) = ep x , 
for a; € £. 

We have 

g{u)$ e x {du) = e[^(oi)p x + 9 e g {x)}, 



I' 



where ^(a;) := eaQg(eaa) / s 2 Oq = ^af, ' °(1) = °( £ )> fore ~ * 0> an d 



/' 



t**§(rfti) = e[{ao + aiP*) + e (aO], 

where s (x) = —eaop x . 

For the limit process, we have P(a° = ai) = 1, thus 

i4°(<) = gao« + oii/°(t), 

with Ei/°(i) = g <, 9 = A + /i, # = <?Po = <?(Pi + P2V2. 

Let us now take: A = // = 0.01;pi = 0.5;p2 = 0.6; ai = 100; 
ao = — 2;£ = 0.1. Then we get qo = 0.0165, and figure 1 gives two tra- 
jectories in the time interval [0,4500], one for the initial process and the other 
for the limit process. 

13.4 Increment Process in an Asymptotic Split Phase 
Space 

The switching Markov process X s (t), t > 0, is here considered in the series 
scheme with a small series parameter e > 0, on an asymptotic split phase 
space: 

E=\jE v , E v f)E v , = 0, v ^ v', (4.1) 

vev 
where (V, V) is a compact measurable space. The case where V is a finite set 
is of particular interest in applications. 
The generator is given by the relation 

QV(z) = f Q e (x, dy)[<p{y) - <p(x)}. (4.2) 

JE 
The transition kernel Q e has the following representation 

Q £ (x, B) = q(x)P £ (x, B) = Q(x, B) + eQ x {x, B), (4.3) 
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Figure 1. Trajectories of initial and limit processes, and of the drift. 

with the stochastic kernel P s representation 

P £ (x,B) = P{x,B)+eP 1 (x,B). (4.4) 

The stochastic kernel P(x,B) is linked with the split phase space (4.1) as 
follows 



x e E v 
x & E v . 



P(x,E v ) = l v (x):=S [ 1 — 2. W 

In the sequel we suppose that the kernel Pi is of bounded variation, i.e., 

|Pi|(x,£) <+oo. (4.6) 

According to (4.4) and (4.5), the Markov process x e (t), t > 0, spends a 
long time in every class E v and the probability of transition from one class to 
another is in 0(e). 

The phase merging scheme [Korolyuk, 1999] is realized under the condi- 
tion that the support Markov process x°(t), t > 0, defined by the kernel 
Q(x, dy) = q(x)P(x, dy) is uniformly ergodic in every class E v , v EV, with 
the stationary distributions 

n v (dx)q(x) = q v p v (dx), q v := / TT v (dx)q(x). (4.7) 

Let us define the merged function 

v{x) = v, x e E v . (4.8) 
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By the phase merging scheme [Korolyuk, 1999], the merged Markov pro- 
cess converges weakly 

v(x e (t/e)) => x(t), t > 0, as e -» 0, (4.9) 

to the merged Markov process x(t), t > 0, defined on the merged phase space 
V by the generating kernel 



Q{v,B T ) = f n v (dx)Q(x,B r ), B r = (J E v , T 6 V. (4. 



10) 



The counting process of jumps, noted v(t), can be obtained as the following 
limit [Korolyuk, 1995] 

ei/ s (t/s) =» v(t), as e -+ 0, t > 0. 

THEOREM 2 Under the Assumptions A2-A5, in the phase merging scheme 
the increment process with Markov switching in series scheme 

e(t):= J2 «*(**)> *>0, (4.H) 

fe=i 

converges weakly to the additive semimartingale £o(*)> t > 0, which is defined 
by its predictable characteristics, 

B(t) = / b(x(s))ds, b(v) = q v a(v), a(v) :— / p v (dx)a(x); (4.12) 
Jo Je v 

the modified second characteristic C e converges to 

C(t) = / C(x(s))ds, 
Jo 

where 

C(v) — q v p v (dx)Co(x), v € V and Cq(x) = / u 2 $ x (du). 
Je v ' Jr 

(4.13) 

And the predictable measure is 

"t(9) = ®x(s)(9)ds, $v(g) = qv$v(9), (4.14) 

Jo 

v{g) ■= / p v {dx)$ x {g). 
Je„ 



where 
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The semimartingale £o(*)> t > 0, with predictable characteristics (4.12) and 
(4.14), can be represented in the following form 

Co(t)= f A°(ds;x(s)), (4.15) 

Jo 

or, in the equivalent increment form 

Ht) 

cow = E<-!^) + A h } m))- (4.i6) 

The compound Poisson processes A® (t) are defined by the generators 

A{v)<p{u) = q° [<p(u + z) - ¥>(«)]*„ (dz), 
Jn 

and v>y{t) are the counting Poisson processes characterized by the intensity 
q® = q v <t v (l). It is also defined explicitly by 

A° v (t) = £ o& + tq v a° v , veV, 
1=1 

for fixed v € V, where Q^ e ,£ > 1, are iid r.v.s with common distribution 
function defined by the measure 

*«((/) = * v (g)/* v (l). 

The drift parameter is given by 

a° v = a(v)-^ v (l)JEa° vV 

In applications, the limit semimartingale (4.15) can be considered in the 
following form 

Ht) 

Co(*) = Yl M&fc-i) + 7(*)c(x(t)) + Mt), (4.17) 

k=i 

where l^o(t) is a martingale fluctuation. The predictable term in (4.17) is a lin- 
ear deterministic drift between jumps of the merged switching Markov process 
x{t), t > 0. 
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13.5 Continuous Additive Functional 

We consider an additive jump functional with semi-Markov switching in a 
Poisson approximation scheme depending on the small series parameter e > 0, 
namely 

?(t)= f r, e (ds;x(s/e)), (5.1) 

Jo 

where r] e (t; x), t > 0, x £ E, e > 0, is a family of Markov jump processes in 
the series scheme defined by the generators 

V £ {x)<p{u) = e _1 / [<p(u + v)- <p{u)]V e {dv\ x), x G E. 



Assumptions (C) 

(CI:) The switching semi-Markov process x(t), t > 0, is uniformly ergodic 
with the stationary distribution 

ir{dx) = p(dx)m(x)/m, (5.2) 

m{x) ~ m x = / G x (t)dt, m:= I p(dx)m(x), (5.3) 

J0 J E 

p(B)= [ p(dx)P(x,B), p(E) = l. (5.4) 

Je 

(C2:) Approximation of the mean jump 

a e (x) = [ vF £ (dv; x) = e[a{x) + 9 £ (x)] (5.5) 

Jr. 

and a{x) is bounded, i.e., \a(x)\ < a < +oo. 
(C3:) Poisson,approximation condition 

r|(x) = / g(v)F £ (dv; x) = e[r g (x) + e g (x)} (5.6) 

for all g € Cs(JR), and the kernel T g (x) is bounded for all g E C^CR), 

|r fl (x)| < r 9 . 

The negligible terms in (5.5) and (5.6) satisfy the conditions 

sup \6 e (x)\ -> and sup \0 £ (x)\ -» 0, as e -* 0, (5.7) 
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for all g € C 3 (IR). 
(C4:) Uniform square-integrability 

lim sup / vv*T(dv; x) = 0, (5.8) 

c -+°°xeEJ\v\>c 

where the kernel T(dv; x) is defined on the measure-determining class 
C3(H) by the relation 

T g {x)= f g{v)T{dv;x), g € C 3 (U). (5.9) 

Jn 

(C5:) Cramer's condition 

/ e hs G x {s)ds < H < oo. (5.10) 

Jo 

THEOREM 3 Under Assumptions C1-C5, the additive functional (5.1) con- 
verges weakly to the CPPD £,o{t), t >0, defined by the generator 

t<f(u) = af'{u) + / [<p(u + v)- <p(u) - v<p'(u)]F (dv), (5.11) 

Jn 

where 

a= ir(dx)a(x), (5.12) 

JE 

and 

f(dv)= / 7r(dx)r(di;;a;). (5.13) 

JE 

The additive jump functional (5.1) in the Poisson approximation scheme 
can be considered with the semi-Markov switching in the split state space (see 
Section 4, Theorem 2). 

Due to both the representation (5. 1 1)— (5.13) of the limit generator, and the 
approximation conditions C2 and C3, the small jumps of the initial functional 
are transformed into the deterministic drift Uo(t) = atf, 

a = a-b, b:= vt{dv). (5.14) 

./R 

The big jumps of the initial functional (5.1) are distributed following the 
averaged distribution function 

F(dv) := t(dv)/r(JR), (5.15) 
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with the intensity of jump moments 7 := r(lR,). The limit Markov process has 
the representation £°(£) = U°(t) + C°0O> wnere the Markov process (^°(t) has 
the following generator 

to<p(u) = 7 / [«p(u + v) - ip(u)\F(dv). 
Jn 

13.6 Scheme of Proofs 

Let us give here the main steps of the proofs in the case of the continuous 
additive functional £ e (t) given in (2.3). 

The weak convergence for additive functionals with semi-Markov switching 
is considered here as in our previous paper [Korolyuk, 2002] in the setting 
of the books by Jacod & Shiryaev [Jacod, 1987] and Ethier & Kurtz [Ethier, 
1986]. 

The semi-Markov switching requires new approach based on the compen- 
sative operator of the Markov renewal process, see [Sviridenko, 1989]. The 
additive jump functional (5.1) is first considered as an additive semimartingale 
defined by its predictable characteristics [Jacod, 1987; Liptser, 1989; Liptser, 
1991; Borovskikh, 1997; ginlar, 1980]. 

The main steps ofproofs include: the construction of the predictable charac- 
teristics ofthe semimartingale £ e (t), the construction of compensative operator 
of the extended Markov renewal process, convergence of predictable charac- 
teristics, and identification ofthe limit process. 

LEMMA 1 Under the assumptions of Theorem 3, the predictable character- 
istics {B £ (t),C s {t),-f(t)) (see [Jacod, 1987], Theorem VI. 3. 31) of the semi- 
martingale 



'■(t)= f r) e (ds;x(s/e))+e , (6.1) 

Jo 



are defined by the following relations: 

rt/e 



B l 



rt/e 

(t) = s[ a(x(s))ds + 6 £ b (t)] , t>0. (6.2) 

Jo 



The modified second characteristic is 

H/e 



C E (t) = e [ f C(x(s))ds + e £ c (t)] , t>0. (6.3) 

Jo 

The predictable measure is 

rt/e 
l e 9 (t) =e[ V g {x{s))ds + 6$(tj], t > 0, (6.4) 

Jo 
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where C{x) — J R d vv*Y(dv\ x). v* means the transpose of vector v. 

Note that 0|, Z C , and 8i satisfy the negligible condition (see [Jacod, 1987], 
Lemma VI.3.31], 

sup \0 £ (t)\ -> 0, as £^0, for all T > 0. 

In what follows, it is sufficient to study only the convergence of 
(Bg(t),Cg(t),'yg(t;ff)), where 

B £ (t) = / a{x{s/e))d8, t > 0, (6.5) 

Cg(t) = / C(x(s/e))ds, t > 0, (6.6) 

J £ (t;g)= f T g (x(s/e))ds, t>0. (6.7) 

./o 

In the sequel the process A e (t) will denote one of the above predictable char- 
acteristics Bg(*),Cg(t),7g(t). 

The following auxiliary processes will be used: 

v £ {t) := max{n : t^ < t} = max{n : r n < t/e}, 

01(*) = t-T e (*), 6»5.(t)=r^(t)-t. 

The extended Markov renewal process is considered as a three component 
Markov chain 

A £ n = A £ {r £ n ), x £ n , r*, n>0, (6.8) 

where x £ n = x £ (t^), x £ (t) := x(t/e) and r^ +1 = t £ + £0£, n > 0, and 

P(^+i < < I < = *) = G«(t) = P(0* < t). (6.9) 

We are using here the notion of compensative operator introduced by Wentzel 
& Sviridenko (see [Sviridenko, 1989]). 

DEFINITION 1 ([Sviridenko, 1989]) The compensative operator JL e of the 
extended Markov renewal process (6.8) is defined by the following relation 

I/<p(u,x,t) = {E[<p(Alx\,Tl) -<p(u,x,t) | 7f]}/em(x), (6.10) 

where m(x) = TE,0 x = f£°G x (t)dt, 

J=f := a(A £ (s),x £ (s),T £ {s); 0<s<t). (6.11) 
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Let.4 t (a;), t > 0, x € E, be a family of semigroups determined by the 
generators 

A(x)<p(u) = a(x)<p'(u). (6.12) 

LEMMA 2 77?e compensative operator (6.10) of the extended Markov re- 
newal process (6, 8) can be defined by the relation 

JL e tp(u,x,t) = 

/ G x (ds)A £S (x) / P(x,dy)<p(u,y,t + es) - <p(u,x,t) /em(x). 
Jo Je 

(6.13) 

The proof of Lemma 2 follows directly from Definition 1. 

LEMMA 3 The extended Markov renewal process (6.8) is characterized by 
the martingale 

n 

fc=0 
In what follows the martingale property will be used for the process 



C(t) = p(A e (Tl(t)),X S (Ti(t)),Ti(t)) 

JL £ ^(A e (r £ ( S )),x £ ( S ),r £ (s))ds, (6.15) 

/o 



Jo 



where T%{t) := r„* (t)> v^(t) := u e {t) + 1. 
Note that the following relations hold: 

C £ (rn) = /4 +1 , n > 0, (6.16) 

and 

C(t) = C £ (r £ W), for r e (t) < t < r e + {t). (6.17) 

The random numbers v+{t) = v e {t) + 1 are Markov moments for 

^ = a(A s k ,x%,Tl; 0<k<n). 

LEMMA 4 The process (6. 1 7) has the martingale property 

HC(t) - C(s) | F £ s ] = 0, for 0<s<t<T. (6.18) 

Note that the process £ £ (t), t > 0, is not a martingale since it is not T^- 
adapted. The next lemma is basic in the proof of the compact containment 
condition for the additive functionals A £ (t), t > 0. (Compare with Lemma 
3.2 [Ethier, 1986]). 
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LEMMA 5 Theprocess 

£{t) = e-*W<p{A^t)),*Wt)),7i{t)) 

+ / [ e - c v(^«( s )),x e «( s )),<( s )) 

Jo 

-e- CTe ^I, e ip(A £ {r e (s)),x e (s),T £ (s))]ds (6.19) 

has the martingale property for every c£E, i.e., 

nC(t)-C e c(s)\T!} = 0, for 0<s<t<T. (6.20) 

The algorithm of Poisson,approximation given in Theorem 1 provides the 
asymptotic representation of the compensative operator. 

LEMMA 6 The compensative operator (6.13) applied to function 
<p € C 2 (R) X B(E) has the asymptotic representation 

TL £ <p(u, x) = e~ l Qip{u, x) + A(x)Ptp(u, x) + e6 £ (x)P(p(u, x), (6.21) 

where 

Q<p(-,x) = q(x) / P(x,dy)[ip(-,y) - <f(-,x)], q(x) := l/m(x), (6.22) 

JE 

A(x)<p(u,-)=a(x)<p! u (u,-). (6.23) 

And the negligible operator is defined as follows 

9 £ (x)<p(u, •) = A 2 (x)A e (x)<p(u, •), (6.24) 

where 

A £ (x) = (°° A es {x)G ( *\s)ds, (6.25) 

Jo 

G x 2) (s):= J°°G x (t)dt. (6.26) 

Note that the remaining term in (6.21) is computed by using the relation 

A £ {x)= f G x (ds)A £ (x)= f°°A £S (x)G x 2) (s)ds. (6.27) 

Jo Jo 

LEMMA 7 A solution of the singular perturbation problem 

I/[v?(u) + e<pi(u, x)] = L<p(u) + ee £ Q (x)ip{u) (6.28) 

is given by the generator 

L<p(u) = aif'(u). (6.29) 
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The negligible term in (6,28) is represented as follows 

O e (x)<p(u) = \{A(x) + e0 e (x))R o A(x) + £6 e (x)]<p(u). (6.30) 

The following compact containment condition together with the submartin- 
gale condition (see [Korolyuk, 2000]) provides the compactness of the family 
(A e (t), t > 0, e > 0). 

LEMMA 8 The family of processes (A e (t), t > 0, < e < Eq) with 
bounded initial value E |A e (0)| < b < -f-oo, satisfies the compact containment 
condition (see [Ethier, 1986]) 

lim sup P( sup |i4 e (t)| >*] =0. (6.31) 

Z->°° 0<e<£ \o<<<r J 

The completion of the proof of theorem is realized by the scheme described 
in our previous paper [Korolyuk, 2000], by using Theorem VIII.2.18, in Jacod 
& Shiryaev [Jacod, 1987]. 
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Abstract In this article we review the problem of discretization-regularization for inverse 

linear ill-posed problems from a statistical point of view. We discuss the problem 
in the context of adaptive model selection and relate these results to Bayesian 
estimation. 

Keywords: Model selection, penalized estimation, Rosenthal type inequalities, ill-posed 
problems. 

14.1 Introduction 

In many situations we require estimating a certain function / e H, a given 
Hilbert space, based on indirect observations y* = (Af)(xi) + rn,i = 1, ...,n, 
when A is an ill posed operator. That is, when A does not have an inverse or 
when its inverse is not continuous. 

Here t]i is assumed to be a zero mean i.i.d. sequence of generally non 
bounded random variables which accounts for a perturbation of the true value 
A(f)(x{) and Xi,i = l,...,n is assumed to be a fixed set of observation 
points. 

As A is ill posed, searching for the solution / based on the noise corrupted 
observations yi,i = 1, ... ,n is useless. It is usual to look instead at solu- 
tions that not only adjust to the observations but are regular as defined by a 
given functional J{f). Thus we search for the solution / of the minimization 
problem 



mm(\\y -A(f)(x)\\ {n) + J(Xf)) (1.1) 
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here || • ||?* = ^ YH=\(Vi ~ A(f)(xi)) 2 , and A is a regularization parameter 
which must be chosen according to some criteria. Typically methods such as 
the L-curve methods [Engl & Grever, 1994] or Morozov's discrepancy prin- 
ciple (see for example [Frommer & Maass, 1999]) in the least squares case, 
or statistical methods such as PMSE (Predictive mean square error) or cross- 
validation are used (for example see [O'sullivan, 1996]). Equation (1.1) also 
includes most estimation schemes based on entropy methods [Gamboa & Gas- 
siat, 1997], [Gamboa, 1999]. 

In practice however (1.1) is hardly ever considered. Indeed solutions are 
seeked for in finite dimensional closed subspaces S m (dim(S m ) = d m ) of H. 
Normally, the restricted problem is also ill conditioned and must be regular- 
ized. This yields a sequence of closed subspaces S m indexed by m G M n , a 
colletion of index sets, and a sequence of regularization parameters \ m . An 
important problem is thus how to choose a "correct" subspace S m based on the 
data and how to interpret the sequence of A m in such a choice. 

We shall refer to the, penalized model selection framework developed in a se- 
ries of works by Birge and Massart [Barron, Birge & Massart, 1999] (see also 
[Birge & Massart 2001], [Birge & Massart, 1998],[Massart, 2000]) based on 
the idea of sieves due originally to Grenander [Grenander, 1981]. Related ideas 
are also developed by Vapnik [Vapnik, 1998] in his Structural Risk Minimiza- 
tion setting. This is a statistical point of view and solution choice is compared 
to optimal rate estimation over certain classes of functions. 

Basically, the idea is to penalize high dimensional spaces. Intuitively, esti- 
mation will be better if d m is large, but then A will be harder to control (this 
will be true even if the operator is not ill posed). Penalization should be chosen 
in such a way as to obtain almost optimal results. That is, the chosen solution 
should be the (almost) best among all possible choices of subspaces S m , for 
m e M n . 

Many authors have addressed the problem of simultaneous discretization- 
projection and regularization for ill posed problems (see for example [Kilmer 
& O'Leary, 2001], [Maass et al, 2001], [Neubauer, 1998], [Solodky, 1999]). If 
regularization is done by projection (truncated S.V.D.), the problem is essen- 
tially that of determining a "good" subspace. This can be done by selecting 
a cutoff point or by threshold methods. As will be seen this amounts to an 
appropriate selection of a penalization term for the dimension. Choosing the 
right subspace will be called model selection. 

If regularization is done by Tikhonov [Tikhonov & Arsenin, 1998], a prob- 
lem cited by many authors is whether the appropriate regularization parame- 
ter for the projected solution is also appropriate for non projected one. From 
a model selection point of view, the problem is stated as minimizing a cer- 
tain contrast function over a certain parameter space for each d m dimensional 
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subspace and then choosing the best subspace, corrected by the appropriate 
penalization term. 

The main goal of this article is to present simultaneous discretization-regula- 
rization in an adaptive fashion, based on the ideas of model selection and to 
interpret solutions in a Bayesian point of view, considering a prior distribution 
over the family of models and a suitable prior over all possible solutions in a 
given model. 

We start with a short review of the main ideas of model selection, as devel- 
oped by Birge and Massart. We then describe minimax estimation bounds for 
ill posed problems, and finally apply model selection techniques to ill posed 
problems. 

In Section 14.5 penalized minimum contrast estimation is presented in a 
Bayesian framework. 

In the last Section we discuss a different choice of the regularization func- 
tional, namely J(-) such as J(f) = ||/||i as proposed by Aluffi-Pentini et al. 
[Aluffi-Pentini et al, 1999]. It can be seen that this estimator, for an L 2 loss 
function is actually soft thresholding [Kaliffa & Mallat, 2001]. 

14.2 Penalized model selection [Barron, Birge & Massart, 
1999] 

Consider the direct problem of estimating a function / £ L 2 (T, //) based on 
observations 

Vi = f{xi) + r]i, i = l,...,n 

where as before we assume Xi, i = 1, . . . , n to be a fixed design (actually we 
could consider the more general problem of the white noise framework). Al- 
though not specified, usually we associate the above problem to an orthonormal 
basis {<f)j}j € z,. The problem is then analogous to selecting the correct param- 
eters of function / over this basis. This is usually done by minimizing a certain 
discrepancy functional I over the n observations. Function I is called the loss 
function and 7 n (/) = ^ Ya=i KlJi ~ f( x i))> is called the empirical risk func- 
tion, so actually the idea is to find / that minimizes the empirical risk. If we as- 
sume f(x) = Ysjem fj</>j( x )> the above strategy leads to / = J2jem fj<t>j( x )< 

with/j = zT,i=iVi<t>j( x i)- ff raisknown beforehand then E||/-/|| = 2^2, 
where a 2 = var(Q), for each i = 1, . . . , n. What happens if we do no know 
ml One possibility is estimating / m for m in a certain subset and compare 
E||/ - /mil = 11/ - IWH + 2^a. As / is not necessarily equal to II m (/) the 
first term controls this error. This is the typical bias-variance decomposition. 
If the function is sufficiently regular, for example if 



E 



f]f° < Q, 
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then||/-n m /|| is known and it turns out that E||/-/ m || =0((Qn a ) 2 ^ 2a+1 ). 
This suggests searching m until this bound is obtained (if m is too big the 
variance term will be too big, if m is too small the bias term will be too big). 
However, if a is not known, underestimating this parameter will lead to a bad 
choice of m and the risk will be too big. If we overestimate a we will force the 
risk to be too small which is known as overfitting (we adjust our data only). 

Adaptive model selection is a technique which penalizes the dimension of 
the estimating subset (considering a set S n = {S m \m £ M n },with dim(S m ) = 
dm), in such a way that if we choose f m by minimizing 



1 n 

-$2%i - H x i)) +pen(rn), 

for I a certain loss function, then, there exists a constant K such that 



E||/ - f m f < C inf ( inf ||/ - f m f +pen{m)) + -. (2.1) 
meMn f m es m n 

Usually pen(m) — KL m d m /n, where k > land L m is a sequence which 
is incorporated in order to control the complexity of <S n . It is chosen so that 

Y] e~ Lmdm < K < oo. 

meM n 

If the number of subsets S m with equal dimension is small, i.e. if 
supj( j log |{m|d TO = j}\) < v, then it is enough to choose L m = L > u. 

If the number of subsets with equal dimension is big, for example in the 
problem of complete model selection, L m must be chosen non constant. In- 
deed, following [Barron, Birge & Massart, 1999] assume we choose among all 
subsets ofthe set {<t>j}jz{i r .. t N}- I n this case the cardinality of all models with 
dimension d m is equal to 



j^ ) < (eN/d m f 



as the cited authors show in Lemma 6. This implies that a good choice is 
L m — c (l + log TV). The authors further show that in fact this choice yields 
the hard threshold estimator of Donoho and Johnstone [Donoho, 1995]. 

Actually, model selection allows for much more general contrast functions 
7 n based on the empirical distribution, which may yield the problem non lin- 
ear. Think, for example, ofthe correct choice of a neural network, or maximum 
likelihood estimation for non Gaussian error distribution. The results cited in 
equation (2.1) are quite general and include these cases provided the contrast 
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and the family of subspaces S m satisfy certain conditions (Theorem 7.1 [Bar- 
ron, Birge & Massart, 1999]). 

An important issue is that equation (2.1) is non asymptotic and is useful if 
the choice of the penalization term yields optimal results, i.e. minmax estima- 
tion rates and constants. Based on these ideas we shall see that the bounds in 
(2.1) can be obtained in the ill-posed case, and compare penalized estimation 
to optimal linear and minimax estimation in this case. 

As we mentioned in Section 1, we refer to the fixed point design problem. 
Optimal results for this problem based on adaptive model selection in the well 
posed case are given by [Baraud, 2000]. 

14.3 Minimax estimation for ill posed problems 

Assume A to be a known, linear operator A : H\ —> H2, Hi Hilbert spaces 
with inner product <, >}j i and norm || ||^. 

Our aim is estimating f € Hi given the set of indirect observations 
(xi,yi)2=i, for a fixed point design (xj)" =1 , which are assumed to follow the 
model 

y i = Af(x i ) + T] i . (3.1) 

Here {^i}|Li is a centered and i.i.d sequences ofr.v. with finite p th moment 
and variance a 1 . As / € Hi, we approximate f(x) — J2je m fj ( t > j( x ) m 
terms of some orthonormal basis {<j)j} of Hi (M = \m\ can be 00), where 
fj = {f,(j>j) stand for the Fourier coefficients of/ with respect to the given 
basis. The choice of a finite M in a data driven fashion is part of the problem 
we address here. 

We assume also that there exists a basis {rfij} of if 2 such that (Af,ipj) = 
fjbj, with bj > and bj — + 0. This happens if, for example, A admits a 
Singular Value Decomposition. 

Let M n be a colection of index sets (m € M n , m— {ji, . . . , jd m }), an d let 
(S m )M n b e the sequence of closed linear subspaces of Hi, S m = span{<^j, j E 
m}, with dimension d m < 00. 

We also need some notation concerning the fixed point setting. For g,h £ 
H 2 , set <g,h >( n )= £ Y%=i 9{xi)K x i) and llffllfn) =< 9,9 >(n)- Also le t 



*->m — 



1 n 
-^2^ji.Xi)llfk{Xi) 



n 



(j,kem) 



be the Gram matrix associated to {i/jj} over S m . Set B = Diag(bk)ksm and 
define A m = BTi m B. Let 



1 n 
Vk = -^2yibki>k{xi)' 



n ■ 1 
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Finally set x = A^{yj)j &m and 

1 n 

n «=i 
Then the estimation problem is equivalent to estimating / from 



2j = fj + Cj (3-2) 

In this problem the noise is not white. If E m for all m € M n is diagonal 
(which occurs if the basis is orthogonal for the fixed design), then it will be 
uncorrelated, so that if the original noise is Gaussian then £ will be an inde- 
pendent sequence. Let a^ = 1/bk then we also have varC,j = a? which tends 
to infinity as j — ► oo. Thus the problem is transformed into a noisy problem 
with dependent and growing variance noise. 

An estimator for f\s m will be called linear if / = Cy, with C a given 
matrix. 

If «/(/), for/ restricted to S m is a quadratic functional, the resulting esti- 
mator is linear. This relates linear estimators to quadratic regularization func- 
tionals. In the rest of this Section we discuss efficient estimation for linear 
estimators in the case E m is diagonal (for all m € M n ). This means assuming 
su PmeM„ d m <n. 

In this case Xj = OjXjj and we will say the estimator is linear if fj = hjyj. 
We have 



R(m,h,f)=JE\\f-f\\ 2 H 1 ^J2( 1 - b ^) 2 fk+Efk +a2 / n T, h l 



4 

k£m k€m c fcem 

(3.3) 

For fixed /, the minimum risk is attained at [Tsybakov, 2000] 



This factor cannot be calculated since it depends on /. 

14.3.1. Minimax estimation over ellipsoids 

An important problem is thus giving the minimum risk over a family of 
functions / with prescribed regularity. The next example, develops these ideas 
for a specific family of functions. Set = ©(a, Q) equal to the set of functions 
/ such that 

£/JS 2 <<?. 

3 
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The linear minimax risk RL(m, Q) over 0, is the minimum linear risk over 
the worst case in this set for each case 

RL(m, Q) = inf sup R(h, m, f) 
h fee 

and the minimax risk over Q, considering all estimators linear and non linear, 

is 

R(m,Q)= .inf supE||/- ff Hl . 
fes m fee 

An estimator that achieves the lower bounds is called a linear minimax estima- 
tor (respectively a minimax estimator). 
Let 

""a 



Qn/aZ + ZU^hT 
where t = max{£\a 2 /n £ J=1 o^ajiai — clj) < Q}. 

The next result is due to Pinsker [Pinsker, 1980]. 

Theorem 7 Let {clj} be a non-decreasing sequence of non-negative num- 
bers such that a,j — -> oo and let bj > Ofor each j = 1, . . .. Then the linear 
minimax estimator is given by fj = hZyj, Kj = max(o~j(l — k n a,j), 0), and 

T 2 

p2 



RL(m,Q) = ^^(h.f+^2ff 



n 

j em jem c 



Also, if 



maxj-< d of 

-= j- — > as a — > oo, 



?/ze« 



R(m, Q) = RL{m, Q)(l + o(l)). 



This result gives minimax rates for linear estimators for this family of func- 
tions and gives conditions under which minimax linear rates are asymptoti- 
cally the best possible rates. However, the above results depend on a known 
sequence {a?}. In general, when dealing with real data this kind of informa- 
tion is not available. The problem is to develop strategies based only on the 
data. 
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We may consider it convenient to restrict our attention to linear estimators 
over a certain subset A. For example, when considering Pinsker weights 



/ lj = <r j (l-(^n + , 



w 
with w, a in a certain set. Or 

hi = — ^- , 
' l + (£) a 

which corresponds to the Tikhonov-Phillips weights. 

In this case, in the spirit of the above results, a linear estimator h* will be 
asymptotically efficient if 

R(m, h*, /) < (1 + o(l)) min R(m, h, f). 

heA 

Of course, when estimating, we do not know how to choose good weights, or 
for that matter a good estimating set m € M n . Estimators are adaptive if, only 
based on the data, they are able to achieve efficient rates. 

14.4 Penalized model selection for ill posed linear 
problems 

To get a flavor of penalized model selection, in this Section we develop 
two examples. The proofs are rather technical so they are given in the Ap- 
pendix. These results say how penalization terms must be chosen, in terms of 
the dimension of the underlying subspace. They also say, that under additional 
technical conditions these results are good, in the sense they achieve optimal 
rates. We stress that what we are doing is controlling complexity by means 
of dimension. However, the ill posedness, as measured by the sequence {bj} 
must be considered in this control. 

The proof of Theorems 8 and 9 below are based on Ronsenthal type inequal- 
ities. These results can be improved by giving exponential rates and controlling 
the complexity of the spaces S m in terms of a "covering" number for the L 2 
and L°° norms, but we have rather not included this additional complication. 
Our proofs follow closely those of [Baraud, 2000]. 

We will study two situations: 

(A.) When the Gram matrix S m = I (that is, when the basis {<pj} is or- 
thonormal for the given fixed point design). In this case A m € A m , a 
given parameter space. Although the general case with a nondiagonal 
Gram matrix could be dealt with in this situation, it complicates notation 
and doesn't really add any further insights to the problem. This case is 
studied in the simpler setting below. 
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(B.) When the estimation scheme corresponds to the projection estimator. In 
this case, we do not require S m to be diagonal, but a certain restriction 
is imposed on its eigenvalues. This kind of restriction is also found in 
[Kaliffa & Mallat, 2001] when discussing almost diagonal estimation. 

14.4.1. First case 

As in the above setting, consider linear estimators defined by the expression 
fj = QjW(yk/h)Jov j € m. Of course then f(x) = Ejem/j^)- For this 
family of estimators we have, for fixed A and m, that 

eii/ - /i& = E a - qj{ x)?f] + E f! + £ E °faw 2 - 

The goal is then finding m and X m such that the solution is optimal over a 
given set of parameters A. We assume A = U m gM„A m for a given index set 
M n and that A m C A m ' if m C m'. We also assume that for each m, A m is a 
subset of IR dm . 

For A m € A m , set ty(A) = tt^stt, j € m. 

3 3 

As in [Tsybakov, 2000] we shall consider the following contrast, based on 
the risk function: 



7n(A, m) = E(^ A ) - 2q j (A))((y j a j ) 2 - <x 2 <x» + a 2 /n E «?(*)<>? 

(4.1) 



j6m j'Sm 



For each fixed A, set 

i?(A,m) = E( 1 -^W) 2 /|+E/' + ^E^( A ) 2 

,}'€m j£m c j'€m 

We have E(7(m, A)) = i?(A, m) - ||/|| 2 . 

We now introduce the following penalized version of j n given by 

7n en (Am, m) = 7„(A, m) + pen(m), (4.2) 

where pen(m) will be defined below. 

Set /^ ^ = arg min 7n e "(A m , m). We have the following result 

Theorem 8 

■ Assume r\ is such that there exists p > 6 with E|r7| p < oo. 

■ Assume sup meMri sup AmGAm |^(A)| < aj . 
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■ Assume 



mZMn A m eA m Z^jem a j Qj V'v 



sup sup ^ 4fc<A 



Assume sup meMn T,jem a j a j/ n ^ ■ B - 
SetC m = Z: jem a]a]vl. 
Set s m = sup j6ni a?Oj V 1. 

E ^(i^-)^- 1 + E sm'W-^-r i < a 

m^Mn m€.M n 



Assume 



penirn) > a 2 ~(l + « + (2 + - + /i)L m ), 
n « 



/or « > 0, \i > 0. 

Then for a certain K — K(p) which depends on the distribution off], and on 
constants A, B, k and p,, 

Ei?(A, rh) < inf ( inf R(X, m) + penirn)) + CK(p)/n. (4.3) 

m€Mn AeA m 

Remark 14.4.1 The assumptions over the regularizing coefficients qj(X) 
are technical and are given in order to control fluctuations over set A. 

Remark 14.4.2 The inclusion of term L m is necessary as the number of 
terms in the sum over m € M n with the same dimension might be big. 

Remark 14.4.3 The contrast can be written as 

7n(A,m) = E^f W ~ 29 j (A))((y j cr j ) 2 + c/n E 2<fc(A)aJ, (4.4) 

c > O" 2 and with pen(m) > c^(l+K+(2+^-\-fj,)L m )forcangiven constant 
as a may be unknown. In this case however we might be over penalizing. If A 
is finite and sup qj (A)/ inf qj(X) < S, then c, k and p, can be chosen from the 
data (see equation (4.12)). 

Remark 14.4.4 If w m < C, the bounds are as in the usual regression prob- 
lem. Moreover if sup qj(\) / ini qj(X) < S, max m£ Af„ d m < N and L m < c 
(ordered selection) the bounds are as in [Tsybakov, 2000]. 

Remark 14.4.5 If sup qj (A)/ inf q 3 (A) < 5, the penalization term is com- 
parable to the minimum risk, this yields the estimation is efficient modulo a 
constant. 
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14.4.2. Second case 

In this section we are interested in studying projection estimators: that is 
Qj,mW = 1 lf J e m and 9j,m(A) = if j £ rrf. 

For a given linear operator A : H dm — > IR , define, for |||| the usual eu- 
clidean norm 

p(A) = sup 1!M (4.5) 

s.s^O ||s|| 

Let t m = p \£ < . This term will play an important role in the proof 
of a result analogous to Theorem 8. Basically, we will assume that 
su PmeMn *"» < C- Heuristic ally, we can argue that as the number of ob- 
servations Xi,i = 1, ... ,n grows, the associated Gram matrix tends to the 
identity matrix, for, as we recall tp is an orthonormal basis for H%. In fact, we 
shall require for the proof a stronger condition than the one suggested above, 
namely that 

sup sup\E m (k,j) -S kij \ < — . (4.6) 

m€M n k,j n 

This condition, once again can be argued by the above heuristics, as asking 
tnat the Gram matrix be "almost" diagonal. 

We also require some additional notation. Set C m = tr(A^), 

As before, we may consider the estimation scheme in terms of contrasts. 
Let m G M n , and g E JR dm . Set, 

T n (g,m) = \\x m -g\\ 2 , (4.7) 

where x m is defined in Section 3. Of course, minimizing T n is setting 
gj — %m,j and then T n (g) = 0. It will be more convenient however to consider 
7 n (<7, m) = r n (g, m) — ||^m|| 2 instead, and in this case the minimum will be 
ln(9, m) = - J2 jem x 2 m j. Now consider, 

7n en M = inf (ln(9,m)+pen(m))= inf (- Vi^ +pen(m)) 
gem.*™ gem.*™ ^ 

(4.8) 
and define 

m = arg rnin j^ en (m). (4.9) 

m€M n 

If we identify f € Hi with the sequence of its Fourier coefficients over 

basis <p, the estimator of7 pen will be /^ = (x\ ,-)jen- We have the following 

result: 
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Theorem 9 Assume ||/'||jy 1 < B. Assume (4.6) is satisfied. Set 
pen(m) = Ka 2 C m (l + L m )/n, with k > 1. Assume rj is such that there 
exists p > 2 with 1E\r)\ p < oo. 

Assume £ meMn ( &a cf BL ) P ^ C m (l + L m ) < C(p). Then, 

E||/* -/!&<<<«)( inf (inf ||«-/||2 ri +pen(m)) + C(/c > p) ff 2 - 1 -^ 1 



m£M„ s€5 m naP 

(4.10) 

The proof of the last result follows very much as in [Baraud, 2000], and is 
given in the Appendix. 

Remark 14.4.6 The inclusion of term L m is in order to assure that 

Y^meMn ( £ mtm ) < C. Usually for non ordered selection over a finite 
set of possibilities this term is chosen as L m = log N (see Section 2). 

Remark 14.4.7 If E m is diagonal, the penalization can be writen as 
C(l + L m ) 2^2 j£m a j- This case is simpler than the one considered in Theorem 
8 as the problem is really discrete, so constants can be estimated. Departure 
of the penalization from the one given above depends on the eigenvalues of 
E~ . This introduces the idea of almost diagonal estimation as described in 
[Kaliffa & Mallat, 2001]. In the diagonal case, the problem is equivalent to 
hard thresholding estimation [Barron, Birge & Massart, 1999], which yields 

the choice of index j ifxj > \ Ca? log N/n. These rates are optimal in the 

Gaussian case [Kaliffa & Mallat, 2001]. 

Remark 14.4.8 Assume that we look at the problem (in the diagonal case) 

min [mm(\\y-Af(x)\\f n) -\\y\\ 2 (n) )+pen(m)}, (4.11) 

with pen(m) = Ka 2 L m d m /nand L m = (1 + log(N)) (N = max me M n d m )- 
As above, it can be seen that the minimum is obtained ([Barron, Birge & Mas- 
sart, 1999]) for yj > y ^M , /„ other words, x, > °j\ [^^ , 
which is the solution of the problem as defined above. It is remarked that in 
(4.11) the penalization is just as in the problem with direct observations. How- 
ever, although we can see that both problemas are equivalent we do not have 
an equivalent to Theorem 9 for the contrast 

\\y~Af(x)f (n) -\\y\\ 2 n)+P en(m) 

in the general case 
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Remark 14.4.9 If we assume certain regularity conditions over f, namely 
f £ @(a, Q), both the ordered selection and the truncated selection yield 
efficient rates. In the first case, the choice of the penally yields the quadratic 
risk smaller than 



mf [sup y: fi + l - £ ^ + C / n 

k>r n k 

where r n = inf{d m s.t. Ek>d m 7Z < £ EjerX) 

14.4.3. Choosing the penalty 

Following [Birge & Massart, 2001a], [Lavielle, 2001], in the ordered selec- 
tion case we can choose k in the penalization function from a discrete family. 
Indeed, we have the following result 

Lemma 1 There exists two sequences mi = 1 < m^ < ■ . . and 
cq — oo > ci > . . . defined by 

__ -y n (m i+ ij mi+1 ) - 7 n (mj, / m .) 

Z^j=i °j 2L,j=l a j 
Vc € (ci,c i+ i),m = mi. 



such that 



In order to choose the "correct" dimension m we inspect the longest inter- 
vals (cj, Cj+i),in a sense the most robust as they depend less on small changes 
of the penalization parameter. 

14.5 Bayesian interpretation 

2 2 

Assume, r^ is Gaussian(0, ^-). That is to say each y^ is ff(Af(xi), — ). If 
we look at the likelihood of y = (j/i, . . . , y n ) given / we have 

P(y\f) ~ P{y\f,m)p(f\m)p(m) 

where ~ stands for proportional. 

In terms of the discussion of Section 2, we have log(p(m)) ~ —KL m d m /n. 
So that minimizing 

\\y - A(f)\\ 2 (n) + KL m d m /n, (5.1) 
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is equivalent to maximizing the likelihood of the observations under an im- 
proper uniform prior p(f\m), as suggested by Birge and Massart [Birge & 
Massart 2001]. 

In a Bayesian framework selecting between two models is achieved by look- 
ing at the Bayes factor (see, for example[Han & Carlin, 2000]), that is 

Pjrn = j\y)/P(m = i\y) 
ji P{m = j)/P{m = i) ' 

As shown in [Han & Carlin, 2000], improper priors for s\m = k will ren- 
der improper priors for s as well, so that the Bayes factors are not defined. 
However, a reasonable proposition seems to look instead at the ratio 

5 suPsem=j P(s\y, m = j)/sup s6m=i P(s\y, m = i) 
ji P(m = j)/P(m = i) 

Choosing j such that Bji < B^i for k G M n is exactly penalized estimation 
as in (5.1). It is interesting to remark that in the above setting, model priors 
are selected solely on the basis of their dimension: p(m) ~ f(d m ). In ordered 
selection typically logp(m)/n ~ C „ . which corresponds to a Geometric 
prior. Binomial type priors, yield a heavier penalization, of order d m (l + 
log(K)), for K = max m6 A/ n d m , which corresponds to non ordered selection. 
Poisson type priors yield penalizations of order d m \ogd m . 

In the ill posed case, we look at the renormalized problem associated to x 
instead of the original problem associated to the observations y. This is done 
in order to show the estimation scheme is correct. The priors then become 
functions of J2jem a ] i nstea d of functions of the dimension d m . In terms of 
the contrasts, rather than looking at the discrepancy measure ||y — Af(x)\\ 2 
we look rather at the empirical risk function associated to a linear estimator 
hkXk- That is, the contrast is chosen in such a way that its expectation is the 
risk function plus a constant. Penalized version of these contrasts must take 
into account the variance of the renormalized errors. 

If we consider additionally a regularization term J(Xf), this amounts to 
selecting p(f\m) = J(X m f), that is, assuming the priors are not improper. If 
X m must be chosen also we obtain 



In 



P(y|/) ~ P(y|/» A > m)p(f\X, m)p{X\m)p{m). 

This is what is done in Section 14.4, Theorem 8. 

We remark that in certain cases (see Remark 14.4.8) these penalizations are 
equivalent to the ones given for the well posed problem based on the original 
observations, that is, for the contrast \\y — Af(x)\\ 2 . Also see the discussion in 
Section 14.6 below. 
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If A doesn't have to be chosen (other than its dimension which is controlled 
by p(m)), the problem amounts to minimizing 

P{y\f) ~p(y\f,m)p(f\m)p(m). 

If «/(/) is a quadratic functional of the unknown fj the resulting estimator 
will be linear (including projection estimators). This of course corresponds to 
a Gaussian prior distribution for these coefficients. From a numeric point of 
view, quadratic functionals "boost" eigenvalues Vj of A m by a factor of \j if 
j € m or by oo if j m. Choosing the right \j is thus choosing the variance 
(1/Aj) of fj in such a way that the rates of optimal estimation are achieved. 

This Bayesian point of view is also developed in [Loubes, 2001]. In this 
work the author is interested in obtaining correct rates over ellipsoids of pre- 
scribed regularity and thus chooses Xj,j — 1, . . . , n in order to obtain optimal 
rates assuming known regularity. As regularity is not known beforehand, he 
must consider a prior distribution over the set of possible regularities. This 
prior, q(s) is again chosen in such a way as to assure convergence at optimal 
rates. 

In the next Section 14.6 we discuss J(f) = ||/||i and relate this with soft 
thresholding estimators as in [Kaliffa & Mallat, 2001], although in this case 
we consider a uniform prior over the set of all possible models S m , m € M n . 

14.6 L 1 penalization 

Consider, as in [Aluffi-Pentini et al, 1999] the problem of regularizing func- 
tionals other than quadratic. These authors consider the problem 

min||y-,4/(x)||p + A||/||p, p= l,oo. 

In the penalization context, the latter contrast assumes a uniform distribution 
over the set of all possible m 6 M n , and penalizes rather on the coefficients 
fk associated to the Fourier expansion of/ over the basis {<f>}. 

For the case p = 1, this problem has an interpretation in terms of soft thresh- 
old estimators [Kaliffa & Mallat, 2001]. 

As above, consider minimizing 



minm G M n [min (||y - Af\\\ n) + J(\f)) + pen(m)}. (6.1) 

for a certain A which will be chosen below. 

The solution to this problem for A given is ([Loubes, 2001], [Loubes & Van 
de Geer, 2001]) 



314 RECENTS ADVANCES IN APPLIED PROBABILITY 



X 



\j<j) if xj > \\jo) 



3 n 3 u 3 

Xj + Xja] if Xj<\\j<j) 
if not. 

Assume /belongs to a set S, such that supf eS \fj\ < Sj. It can be seen 
[Kaliffa & Mallat, 2001; Kaliffa & Mallat, 2001a] that in order to obtain 
asymptotically minimax rates, in the Gaussian case, 



A,= 



bj/n if [oj y /1 + Nj )/y/n < Sj 
oo if {oj y/1 + Nj)/y/n > Sj , 

where Nj is the total number of coefficients x whose variance satisfy a certain 
condition [Kaliffa & Mallat, 2001]. 

In the penalization setup, the above is equivalent to considering pen(m) = 
«i 52j em log(Nj) and Xj = 2bj/n. Penalization over the dimension thus 
acts to prevent indexes with big coefficients ajlog(Nj) to appear. Again, in 
Bayesian terms, the prior p(f\m) is an L l prior, weighted by bj/n in such a 
way that it gives less weight to higher dimensions. 

14.7 Numerical examples 

We next show some numerical results for the projection estimator for or- 
dered model selection. Examples are developed with the cosine basis over 
[0,1] and the operator is defined by the sequence bj = 1/y/j. Noise is gaus- 
sian with variance one. 

In each case a series of coefficients are randomly selected for a fixed order 
and then both order and coefficients are estimated from the data. The exper- 
iment is repeated for order 5, 10, 15 and 20. In each case the algorithm is 
allowed to select up to order 40. The number of observations is n = 100. 

The order is selected as discussed in section 4.3. The sequence of constants 
is generated as in equation (4.12) for the whole sequence m = 1, . . . , 40. Then 
the local maxima subsequence is chosen and the lower extreme of the longest 
interval is selected as the appropriate constant. The selected order is the index 
corresponding to the selected value. Another way is looking at the sequence 
of index related to the local maxima. Typically index increase slowly and then 
jump abruptly. The jump point is a good order pointer. 

The figures show the original function, the observations and the reconstruc- 
tion at the selected order. In the last example, the selected order is 13 although 
the correct order is 20. The reconstruction for order 20 is also given. Clearly, 
the reconstruction for the chosen order is better: the illposedness of the oper- 
ator yields a not as good reconstruction for the correct order as for the chosen 
order. 
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ABCcnstnKtjcfi tar setKJfld subapacs 



— Original data 
+ Observations 

Reccmsloiction 




4 0.1 02 0.3 0.1 O.S fift 7 0.0 0.0 1 

Figure 1. Original function, observations and reconstruction. Original order is 5, selected 
order is 4 



Reconslruction fc* r salecrftd Bubspoce 




0.1 0.? 0.3 0.4 0.5 OB D.7 0.0 O.D 1 



Figure 2. Original function, observations and reconstruction. Original order is 10, selected 
order is 9 
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Rflcoratruclicr for selected sJj&pace 



— Original data 
-* Observations 
Reconstruction 




01 2 3 0.4 5 6 0.7 OJ 0.0 



Figure 3. Original function, observations and reconstruction. Original order is 15, selected 
order is 14 



RecOTSIruciion for sstocled era: 



Original function 

• » Observations 

ReconstrucUon tor selected orde 




2 0.-3 4 5 6 7 6 9 1 



Figure 4. Original function, observations and reconstruction. Original order is 20, selected 
order is 13 
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q>«jnHnjcllon 1urct*T*Ct nnfcr 




0.1 0.2 0.3 0-4 ns 0.0 0.7 0« 0.9 1 

Figure 5. Comparing original function to reconstruction using the correct order. Original 
order is 20 (same example as Figure 3 ). 



14.8 Appendix 

Proof of Theorem 8: 
Using standard arguments we have for any A and m, 



R(X, rh) < R(X, m) + pen(m) — pen(m) 

+((7n(A, m) - R(X, m)) - ( 7n (A, rh) - R{\, rh))). 

The basic idea behind the proof, is to bound (in probability) the fluctuations 
of the random part of the contrast using adequate inequalities, which in our 
case are Rosenthal type inequalities as in [Baraud, 2000]. 
Set T(A, A', m, m') = ( 7n (A, m) - R(X, m)) - ( 7n (A', m') - R(X, rh)), fol- 
lowing [Baraud, 2000] we shall bound this expression for all A, X',m and m'. 

Let z k = l/n£" =1 r]itpk(xi). So that 



T(A, A',m, m!) 

= 2 £ z k a k f k ((q k (X) 2 -2q k (X))-(q k (X) ,2 -2q k (X)')) 
k€mUm' 

+ E (*fc - ^/n)a 2 k ((q k (X) 2 - 2q k (X)) 2 - (q k (X)' 2 - 2q k (X)') 2 ) 

k€mUm' 



First we deal with Z\ . 
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Zi = 2 ]T a k f k z k ((q k (X) - l) 2 - (g k (X)' - l) 2 ) 
k€m,r\m' 

+ 2 j; ^/ fe ^(fe(A) 2 -2 gfe (A))-( gfe (A0 2 -2 gfc (A'))) 

fc€mUm'\mnm' 
= ^11 + ^12 

It is shown in [Tsybakov, 2000] that 

(( ft (A) - l) 2 - (<? fc (Y) - l) 2 ) 2 
= [(flfc(A) - 1) + (»(A)' - 1)] 2 (%(A) - 9fc (A)') 2 
< 2[fe(A) - l) 2 + (q k (\Y - l) 2 ]fe(A) 2 + g fc (A') 2 ). 

Also, we have Ixy < ax 2 + ^y 2 for all a > 0, x, y. 
Recall also that sup m6Mn sup A6Am q k (X) < a k . 
Thus, for < a < 1 



Zu< E [«/ fe 2 [(%(A)-l) 2 + te(A')-l) 2 ] + ^4] 

regmnm' 
On the other hand, 



Zn < E WZ + 1444}. 

fcgmUm'\mnm' 
So that 



*i < a(EA a (»W-l) 2 +53/ fc a ) 

fcem fc£ro 

+ «( E /*<«(*') - !) 2 + E A') 

fcem fc€m' 

The latter term is equal to 

-orfDrnV + -oV tD m'T), 

where, for any given m, D m is the n x n matrix 
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D m = [^2 ^{xi) i)k{xe)vlal]i,e=i n- 

feem 

It is straightforward to see that 

Tr< < D m) = \ E °l*l 
fcSm 

and for || || the usual Euclidean norm over IR n , 

p{D m ) = sup Ji^il < y/Tr(D* m D m ) = (i £ <*»*) 1/a . 



x.x^O IliCH "* 



kGm 



In Corollary 5.1, [Baraud, 2000] it is shown that for any n x n matrix M 
and p > 2, 



PtfMr} > a 2 Tr(M) + 2a 2 ^/Tr{M)t + a 2 t) (8.1) 

< r(p)C(p)«- p / 2 p(Af) p_2 rr(M t M), 

where t(p) = E|r?i| p /cr p . 

Let a > and set u = (1 + l/a)L — mTr(D m ) + x/n. We have, forp > 2, 

P{rfD m v > (1 + a)a 2 Tr{D m ) + a 2 u) 

< PtfDmV > a 2 Tr{D m ) + 2a 2 y/Tr(D m )u + a 2 u) 

< T(p)C(p)( rr(£> " >£>m V 2 - 

■u 

Now we bound Z2. Set u>j = z 2 — cr 2 /n and call </j(A) = cr?(gj(A) 2 — 
2 9j (A)). 

Set 



Efcem^fCA) 
So that for any < A < 1, i = 1, 2 



6 m = max( sup — 227T\ ' *)• 

A€Am Litem ^Jfefy \ A ) 
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z 2 < lE#( A KI + lE*( A, HI 

j€m jEm' 

< A/n£ ftX)*} + A/n £ <?(A>* + £ £ «* + ^ E 

j€m jem' j€m j6m' 

< — sup ^. ^ 9y (\)aj + — — sup ^. ^ ^ (A )a,- 

n x~^ i n V s 2 

— > to? + — > wf- 

iPm ifm' 



•»? 



+ 

jem jem' 



It remains to bound the latter terms. 
To begin with, set d — Var(rjf), 

the last inequality because tfi is orthonormal under <> n . 

Now set uj; = «)j-Ett)|. Assume pis even and set D = E((t? 2 — cr 2 ) 2 — d) 2 . 
We have 

j€m i=l 

Which allows us to deduce for each m £ M n 

^E-I>^E^ + ^ (8 - 2) 

jem bmn kem bmU 

< C(p)d m lf m (a 2 L m E 44 + u)~ P - 
kem 

With the above bounds we are ready to continue the proof. By our choice, 
we have pen(m) > a 2 (l + k + (2 + l/n + fj,)L m )/n. Let u < 1 be such that 
(1 + k)i//2 > 1, */(l + n) and set a = v. Let £ = (1 + k)u/2 - 1. Choose $ 
such that j3\b m = v and /?2&m' — y 

By our choice, we have pen(m) > £((C m V l)cr 2 + (6 m , n V l)D)d m /n. 
Then, 
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(1 - i/)JR(A, m) < (1 + i/)(i?(A, m) + pen(m)) 

+ 2/u[rfD m ri y - — L C m ] 

+ 2/v[rfD m r] J — -C m ] 

lb 

+ l/i/fo^rm ^ w| - o 2 —C m ] 

fc€m 

+ l/j/[6 m nn y^ w| - a 2 — C m ] 

k€m 

= (1 + i/)(R(\,m) + pen(m)) 

+ Ti (m) + Ti (m) + T 2 (m) + T 2 (m) 

Thus, if we set 

A = |(l-i/)i2(A,m)-(l + i/)( inf ( inf i?(A,m)) + pen{m))\ 

meM n AeA m 



P(A > a 2 x/n) 

< P(VmeM n Ti(m) > a 2 x/4n) + P(U me M n T 2 (m) > a 2 x/4n) 

< J2 p ( T i( m ) > a 2 */^) + Yl F ( T 2( m ) > ^V 4 ")- 

m€M n m&M n 

By (8.1) and (8.2), for any given m € M n 



P(Ti(m) > a 2 x/4n) 
and, 
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jem 

Adding up, we have 



P(A > cr 2 a;/«) 



me «„ ■(l + l/*)A»cr„ + x' 



s m 



+dm{ L C +x n 



Since for X positive JEX = J P(X > u)du, we then have that 
!E[iZ(A,m)- ( , 1 + l/) ( inf ( inf i2(A,m)) +pen(m))]+ 

1 — 1/ m€M„ AeA m 

< C(p,i/)W(p) 
~ n 

which yields the desired result. 



Proof of Theorem 9: 
The proof of this theorem is essentially as that of Theorem 8. Since there are 
no weights to be chosen, the proofs are actually simpler. We follow closely the 
proof of Theorem 3.1 in [Baraud, 2000]. 

Recall 7^ en = - Y,jem %m,j + pen(m). Also recall that x m>j = f m j + 
Cm,j as defined in Section 3, where f m j corresponds to the respective Fourier 
coefficient of / ( m is an index set) in terms of the orthonormal basis {0}. Or, 
in vector notation x m = n m / + £ m , where II m / is the projection of/ over 
the subset S m . Identify f E Hi with its Fourier coefficients (fj)jS-i € h- If 
g € IEt dm , for some m € M n , 



| 2 -2<n m /,c m >-||Cm|| 2 
||/-n m /|| 2 -2<n m /,Cm>-||Cr, 



7n (m) = -||IWI|-2<IW><m>-||Cmii 

|2 o ^ rr t a ^ 11/- ||2 
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Thus if m is the minimizer of 7n 6n we have, for any other m 



so that 



11/ -/mil 2 < l|/-n m /|| 2 + 2<n m /,a>-2<n m /,Cm> 

+ ||Cm|| 2 - IKmll 2 +pen(m)-pen(m). 

Now, 



2(< IW, Cm > - < n m /, Cm >) 

= 2( 2_^ fkCm,k ~ £ fk(m,k + £ fk{Cm,k - Cm,fc)) 

fc6tn\m 46m\m fcemflm 

< °( £ /2 + £ /*) + £( £ Cm,*+ E &,*) 
(;6m\m fcSm\?fi fcEm\m fc6m\m 

+ 2 I E /fc(Cm,fc -Cm,fc)| 
fcemDm 

< «||n m / - n m /|| 2 + ^iCmll 2 + ^ICmll 2 

+2| E /fc(Cm,fc-Cm,fc)| 
fcemnm 

< a||n m /-/|| 2 + a||n m /-/|| 2 

+ -|ICm|| 2 +^||Cm|| 2 

a a 

+ 2 I E A(Cm,fc -Cm,fc)| 

feSmnm 

where < a < 1. In the proof above, if £ m is the identity for all m 6 M n , 
Cm,fc = Cm'.fc so that the l ast term does not appear. Set c = 1/a + 1, we have 



(1 - a)||IW - /|| 2 < (1 + l/a)||IW - /|| 2 

+c||Cm|| 2 + c||Cm|| 2 

+2| E fk{C,rh,k - Cm,fc)| 

+pen(m) — pen(rh) 
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Thus, 

A = (1 - la)(||IW - ff - A-(||IW ~ ft +pen(m))) 
< cUmt 
+ c||C m |j 2 + 2| ]T /fc(Cm,fe-Cm,fc)| + cpen(m)-cpen(m) 

fcemnm 

with K=(l + l/a)/(l - a) and 
a 2 x s 



P(A > — ) 



n 



^c||Cm|| 2 + cllCmll 2 + 2\^2 A(Cm,fc - Cm,*) I + pen(m) +pen(m) > 



a 2 x 



k€m.r\m 



In order to bound this last probability we have to bound ||Cm|| 2 an d 

I Efeem'nm /fc(Cm',fc - Cm,*) I for any m, w! . 
First, remark that for any m 

i i 

T t Tl 

i,e k 



where i m ,*(^) = ^m 1 (V'm,j( a:; ))j6m> -Dm is the corresponding nxn matrix 
and r) is the original error vector. It is straightforward to check that Tr(D m ) 



l/nTr(A n ^). And because B m is diagonal, we have 



Tr(A^) > inf |E m (fc, ^/-((iT 1 ) 2 ) > p^^TrUB^) 2 ). 

fc€m 



On the other hand, 



P 2 (A») < TriD^Dm) 
= ^Trp-^B-VE- 1 ) 

< ^(p(E- 1 )) 2 Tr(( J B- 1 ) 4 ) 

< -^(p(E- 1 )) 2 r m Tr(( J B- 1 ) 2 ). 

ft 

Set for any nxn matrix M, t> = if Mr). In Corollary 5.1, [Baraud, 2000] it is 
shown that 

P(v > a 2 tr(M)+2a 2 y/Tr(M)t+a 2 t) < Cip^^f^^p^MY^Tr^M), 
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where r(p) = JE\r]i\ p /a p . Now, set u = C2L m Tr(D m ) + ct/n 

P{CD m ( - pen(m) > Ct/n) 
= PiC'DmC > CiTriDm) + c 2 L m Tr(D m ) + ct/n) 
< P(C f AnC > (ci - l/a)Tr{D m ) + 2^Tr{D m )u + (1 - a)u) 

The inequalities for Tr(D m ) and p(D m ) end the proof, very much as in 
[Baraud, 2000] (see the proof ot Theorem 8). 

Second, we must bound R = | Efcem'nm A(Cm',fc - Cm,fe)|- This we shall 
do in several steps. 

First, set tj = ^ Y17=i 1 Pj( x i) r H- Recalling the definition of C, m ^ rewrite 

R = 2 2_j <jS m B m n mnm // - 2 2_^ tj^ m 'B m > n m nm'/ 

j6m j€m' 

= 2 2^, *j(^m An n m nm' — ^m'^m'^nm')/ 
jemUm' 

- 2 ( 2-/ *i) II (^m^m 1 nmnm' ~ £~} JB~Jn mnm ')/|| 2 

jEmUm' 

- a X, *i + ~IK S m 1 - B m 1 n m nm' -^"/B^n^nmO/lli- 

j€mUm' 
The first term in the above sum 

jSmUm' jSm jem' 

n z n* 

with C m = (Ejemi'j( x e)' l Pj( x i))- As before, we have Tr(C m ) = Tr(£ m ) 
and also that p(C m ) < p(E m ). 
For the second term 

ll( S m An n mnm ' ~ S~,B~, n mnm /)/|| 2 

^ ll( S m An n mnm' — An n mnm ')/ll2 + II An' (n mnm ' £ m , .b~, n mnm ')/|| 2 

< IKS- 1 ^- 1 - B-^IU^Vh + \\{B^^B^)Tl mnm ,)fh 

< p(^B- 1 - B-^Wfh + p(B-}H-}B-})f\\ 2 

< /3 (E m 1 )p(S m B m 1 - B- l )\\f\\ 2 + p(E m })p(E m ^- 1 - B-})\\f\\ 2 . 

By assumption, we have \T, m (k,j) - 6 k j\ = |jj£?=i^*0cOVy(si) - 
<5fc,j| < B/n. On the other hand, for any matrix M = (ay^ j =1 d. Then 
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P(M) 2 < Eli £? =1 a*,. Thus, 

Finally, assuming sup m6Mn d m /n < 1 

II (^m B m n mnm / - S m/J B TO , n mnm /)/|| 2 

The rest of the proof now follows as in the proof of Theorem 3.1 in [Baraud, 
2000] (pg. 484-485) (see the proof of Theorem 8). 
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Abstract In this paper, we use the connection between the classic trigonometric Cara- 

theodory problem and the maximum entropy Burg problem for a stationary pro- 
cesses to obtain from an Operator Theory point of view: Levinson's algorithm, 
Schur's recursions and the Christoffel-Darboux formula. We deal with a func- 
tional model due to Arov and Grossman, which provides a complete description 
of all minimal unitary extensions of an isometry by the Schur class, in order 
to describe all the solutions of the Covariance Extension Problem and then we 
obtain the density that solves the maximum entropy problem of Burg. 



15.1 Introduction 

A common problem in practice, is to obtain, as a result of any collecting 
data process in time series studies, a finite complex sequence {cfc}?L p , p a nat- 
ural number and try to known when such a sequence constitutes the first p 
covariance function coefficients. The mathematical formulation of the prob- 
lem is: Under what conditions over {ck} V - p there is at least a measure on the 
unit circle /i such that: 

1 f 2n 
c k = ^j e- ikt dfi(t) k = -p,...,p. (1.1) 

We realize that in such a case we have that C-k =Cfc, k = — p, . . . ,p. This 
problem has a long history. In 1911, Toeplitz dealed with the case that the data 
sequence is of the form {ck}%*L , he proved that if a solution exists then it is 
unique. Problem (1.1), can be seen as a generalization of Toeplitz's problem, 
but now the solution, if it exists doesn't has to be unique, therefore we have 
two additional problems: conditions for the uniqueness and the description of 
all the solutions. In 1940, Nairmark, studied the existence of the covariance 
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extension problem, using Operator Theory techniques (cf. [Sz-Nagy, 1970]). 
In 1988, Dym (cf. [Dym, 1988]) and 1989, Woerdeman (cf. [Woerdeman, 
1989]) described partially the solutions. Since the solution of Problem (1.1) 
isn't unique, it is important to find the one which maximizes the Burg max- 
imum entropy functional (cf. [Burg, 1975], [Castro, 1986], [Choi, 1986] y 
[Landau, 1987]) defined as: 



™'hh 



2rr 

log f(t)dt, 



where / is the density of a measure \i, which is a solution of Problem (1.1). 

In this paper, we approach to this problem from the point of view of Operator 
Theory: we use Arov-Grossman's model. We associate to the given finite set 
{ c fc}fc=0 °f autocorrelation coefficients of a second order centered stationary 
process X, an isometry V acting on a Hilbert space, and we prove that some 
minimal unitary extension of V, generate a process X such that the spectrum 
/ verifies 

f27T 



2Wo 



2tt ' e ~ ikt fW dt = c *< k = -p,...,p. (1.2) 



We use the Arov-Grossman's model (cf. [Arov, 1983]) to describe all different 
spectrum / of X, verifying (1.2). The description is given by the 1-1 corre- 
spondence between such set and a subset of the open unitary ball of i/°°(D), 
the set of all analytic and essentially bounded functions. We use some ideas 
of Marcantognini, Moran and Octavio (cf. ([Marcantognini, 2000], [Marcan- 
tognini, 2001]). 

Furthermore, the density which solves the maximum entropy problem cor- 
responds to H = 0. 

We describe all the densities in the Wiener class that are solution to the prob- 
lem obtained by Dym (cf. [Dym, 1988]) and Woerdeman (cf. [Woerdeman, 
1989]). 

The same approach is used to obtain Levinson's algorithm, Schur's algo- 
rithm and the Christoffel-Darboux formula (cf. [Arocena, 1990], [Bakonyi, 
1992], [Castro, 1986], [Foias, 1990], [Kailath, 1986], [Landau, 1987], [Schur, 
1986]). 

15.2 Notations and preliminaries 

Let C denote the set of complex numbers and let T denote the complex unit 
circle {z 6 C : \z\ = 1}, which is the boundary of D := {z € C : \z\ < 1} the 
open unit disk of C. We write 

L°° ■— {/ : T — > C : / is Lebesgue measurable and ess sup |/(C)| < oo} 
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and L 2 = L 2 (T) denotes the square integrable (with respect to Lebesgue's 
measure, dt, on T) Lebesgue measurable functions from T to C, with the usual 
norm and inner product denoted by ||.|| and (,), respectively. Define e n by 
e n(C) '•— C n > C £ T> n S Z, and recall that they form a complete orthonor- 
mal basis for the Hilbert space L 2 . As usual, /(n) = ^ j^ e~ int f {e lt )dt 
(respectively ju(n) = ^ f Q ™ e~ int dp(t)) denotes the Fourier coefficient of 
the function / (respectively of the finite measure /jl). Also, H°°(I]>) is the 
set of analytic functions, /, on D such that its norm ||/||oo = su P^eD l/( z )l 
is finite. For q = 2, oo, we set H q = H^{T) := {/ € L 9 (T) : /(n) = 
0, n < 0}. Finally, we recall that the Wiener algebra W on the unit circle 
consists of all complex valued functions / on the unit circle T of the form 

/(C) = £r=-ooC fe /(A0, C e T where E^-oo l/(*)l < °°- Let £ p be the 
manifold spanned by {eo, ei, • ■ ■ , e p }. 

A sequence {cfc}^__ C C is said to be strictly positive definite if and only 
if 

p p 

]^ Yl ^nKtCn-m > 0, {A n }£ =0 C C - {0} (2.1) 

n=0 ro=0 

If { c fc}fc=-p ^ a strictly positive definite sequence of complex numbers, 
we can introduce an inner product in £ p by setting, for / = £j!_ ay-e^ and 
9 = ELo 6 * e * 

p p 

(/)g)p = /] ^ anbmCm-n. (2.2) 

ra=0 m=0 

As a consequence of (2.1) we have that (£ p , (, ) p ) is a (p + 1) -dimensional 
Hilbert space. We define T p : (£ p , (, )) — ► (£ p , (, )) by 

(r P f,9)--={f,9) P , f,ge£p. (2.3) 

Clearly, T p is a linear operator and ||r p || < 1. We, also conclude: 

LEMMA 15.2.1 Let p € N, {cfe}^ = a strictly positive definite sequence 
of complex numbers and (£ p , (., .) p ) be the (p + \)-dimensional Hilbert space 
defined in (2,2), LetT> p = Span{ef.}^Z_ , TZp = Span{ek) P k—i be subspaces 
of£ p and set V p : V p — > K p defined by (Vp/XO = C/(0, C € T, / € V p . 
Then, 

(a) V p in an isometry acting on the space (£ p , (, ) p ). 

(b) The orthogonal complement of V p , J\f p = £ p © T> p and the orthogonal 
complement of 1Z P , M p = £ p TZ P have dimension 1. Furthermore, 
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Mr, and Ai n are spanned by n n :— MT ,2'i ,, and m D := „„fi ,. re- 
spectively. 

(c) PjJ e p = fp?„\ = ( 1 — P-p ) e p , where ri^(p) is the p-th Fourier coeffi- 
cient of n p . 

Proof: (a) is immediate from (2.2). In order to prove (b), we recall that the 
operator T p defined in (2.3) verifies {x,y) p — (F p x,y) and since (f,f) p > 
if / € S p - {0}, we have that if / € S p and F p f = 0, i.e. {/, f) p = 0, 
whence / = 0. This shows that T p is injective. Finally, since S v is finite 
dimensional we obtain V p is invertible. Set J\f p = £ p Q T> p . Let / € A/p and 
k E {0, •■■ ,p- 1}, then 

(r p /, e fc ) = (/, e k ) p = 0. 

Since T p / G (£ p , (.,.)) there exists A € C such that T p / = Xe p , so 
/ = Ar~ 1 e p . Therefore, A/p is a 1-dimensional subspace of £ p , moreover, 

Af p is spanned by n p := iip-iell . that is, 

A/p = S'panjnp}. 

The result concerning M. p can be proved in a similar fashion. 
In order to prove (c), we realize that 



n P (p) = ( — ri * ,ep) = llr -i „ <r p 1 e p , T p 1 e p ) p = ||I\. 1 e 



'lip-ip II ,e P' ~ i|T— l- „ 

||A p c p||p II 1 p e PllP 

and therefore 

n 



p fc piip 



^ - (n P ,e p ) p n p - ^^v - ~[ p) 



D 



REMARK 1 LetpEN. We remark that (£ p -i, (•,•)) is a subspace of (£ p , (.,■)) 
and 

P £p_i^pl^ P -i = Tp-i 
so r p _i zs the compression ofV p to £ p -\. Thusifx,y £ £p_i 

(x,y) P = (r p x,y) = (r p _ia;,y) = (x,y) p -i. 

Also, with the notation of lemma 1 5.2.1, it is easy to check that 

■ V p = £ p _i = £> p _i e A/p_i = 7£ p _i e A4 p _i = Vp£> p _i © -H»-i» 
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■ n p = v p v p = n p - 1 @v p Mp-i 

■ Tip e {VpUp-i} = 5pan{efc}^~ 1 

■ £ p = Pi©JVp©A/'p_i©---©M- 

The following lemma establishes a connection between n p and m p and also 
shows where the zeros of both functions lie. 

LEMMA 15.2.2 Given p € N, to {cfc}^__ be a strictly positive definite 
sequence of complex numbers and (8 P , (., .) p ) the (p + l)-dimensional Hilbert 
space defined in (2,2), IfT p is the operator defined in (2,3) then 



(a) T p 1 e p = e p F p 1 e 



(b) n p = e p m p , that is; n p = m p (p) + m p (p — l)e\ H \- m p (0)e p where 

fn p {j) is the j-th Fourier coefficient of m p . 

(c) All the zeros of n p (z) and m p (z) lie in \ z |< 1 and \ z |> 1, respec- 
tively. 

Proof: First, (a) follows from the assumption that r~ 1 eo E S p and it can be 

written as I^eo = En=o a " e n> so ( r p le p> e fc)p = s p( k ) = {epTp l e ,e k )p. 
(b) can be obtained as a consequence of (a) and lemma 15.2.1. Finally, let us 
prove (c). Suppose 7 is a zero of n p . There exists 5 p _i E £p_i, such that 
n p {z) — (z — i)Sp-i(z), zeCor equivalently, 

n p (z) + -ySp-i(z) = zS p -i(z), zET. 

Since n p is orthogonal to E p -\, and V p is an isometry, 

\\n p + fSp-iWl = (n p + ySp- 1 ,n p + 1 Sp- 1 ) p = \\n p \\ 2 p + |7| 2 ||<Sp_i||J 

which yields 

llnplg + |7| 2 ||S P -i||£ = HVpVili; = II Vilg 

whence, 1 — |7| 2 = l/||5 p _i||p > 0, as required. The result concerning to the 
zeros of m p can be proved in a similar fashion. □ 

15.3 Levinson's Algorithm and Schur's Algorithm 

Let p, be a positive finite measure on T, and L 2 (p) be the space of all p- 
measurable and square /x-integrable functions. Let {gfc}^ =0 be the orthonormal 
system obtained by applying the Gram-Schmidt process to {efc}£ =0 . It is a 
classic result that for A; — 1, • • ■ ,p, the following recurrence equations due to 
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Szego (cf. [Bakonyi, 1992], [Castro, 1986], [Choi, 1986], [Kailath, 1986]) are 
verified: 

Qk = ei9fc_i + -YkPk-i „ 

Pk = Qk-i +7keiqk-i, 

where qk is the monic polynomial associated to qk, Pk = ^kQk> Qo — Po = 1 
and |7fc| < 1. 

The following result is analogous to (3.1). 

PROPOSITION 15.3.1 For each p 6 N, p > 2 there exists 7 P G C such 
that 

n p _ C n p-i m p _i 



>; 



n p (p) 7ip_i(p-l) mp-iO-l) 

m p _ rrip-i C n y-i 

mpG 5 ) "*Pi (P - 1) P *Pi (? - 1) ' 

where jpK\ = ejeo — cTeo, mi = e\n\and all members in the formulas are 
the same as in lemma 15.2.1. Furthermore, 

(i?(p))2 . &=M (3.3) 



Proof: Following remark 1 we have the following decompositions 

p £ p p ~ p £ p\z p £ p-i p , i p £ PV P S ?~ 1 p ■, 
r V p e P — ^T> p v P^T> p -i e P-l +- r ZVP- r AA p -i e P-l 

= v P p v P :>P~i + p%t^ p k>^ 

and, 

P v p V P P U p l\ e P-i = F H-, 1 p(Vi.Vi)p-i ?1 p-i 

= (n p _i, e p _i)p_i(m p _i, l^,n p _i) p m p _i 

hence, 

-Pop 6 ? = '^■p p v v ~X e v-^ + (n p _i,e p _i) p _i{m p _i,F p n p _i) p m p _i. 

Thus, 

e P -^Dp 6 ? - "p^p-l ^p-^Dp-! 6 ?-! frTTiCp-l) ' 

which leads the desired result: 



"p(p) npi(p - 1) P rnp~^i(p ~ 1) 
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where 7 p = (m p _i, V p n p -\) p . The other recursion of (3.2) follows easily from 
the equality n p = e p rn^. 

To obtain 7 p , let us rewrite (3.2) in the form 

n p m p -i _ C"p-i 

rhp{p) p mpi(p - 1) np~Zi(p - 1) ' 



thus 



n p m p -i n p _ m p _i 



«p(p) P ™ p -i(p - 1) ' np(p) P m p -i(p - 1) p 

_ , Vpn p -i V p n p -i . 

npl(p- l)'f^i(p-l) p 

using the fact that Vp is an isometry and that n p is orthogonal to m p _i, we find 

1 1 ~ l7 P | 2 

(*V(p)) 2 (^(p-l)) 2 ' 

□ The coefficients j p are called the Schur parameters , this name comes from 
the classical Schur algorithm (cf [Bakonyi, 1992], [Kailath, 1986], [Landau, 
1987], [Schur, 1986]). Indeed, setting G v {z) = ^ (J) an< ^ usm § Levinson's 
algorithm we can rewrite G p (z) as 

zrip-j - 7 P m p -i _ zG p -x - 7 P 
m p _i - 7 p zn p _i 1 - 7 P zG p _i 

15.4 The Christoffel-Darboux formula 

If P G £ p then P is a polynomial function and we can evaluate P(z) 
for z £ P. Let 2 6 D and consider /* : (£ p , (■, -) p ) — ► C, be defined by 
f z (P) = P(z). Clearly, / 2 is a linear function. The next proposition shows 
that f z is continuous, which implies that (£ p , (■, -) p ) can be considered a re- 
producing kernel space. 

PROPOSITION 15.4. 1 Let z E D and E* = T" 1 ££ =0 z fc e fe rten 

p 
(P,E;) p = P(z), E* = J2(EZ,n k ) p n k , (4.1) 

fc=0 

whererio — inpjp andn k , k = 1, . . . ,pare as in lemma 15.2.1. 

Proof: If / € L 2 we have ( rrfc^ eo) = Efclo zfc /( fc )- Tt is eas y to cneck 
that 

P(z) = (P,rf52z k e k ) p and \P{z)\ < ||P|| P ||^|| P 

fc=0 
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which shows that the linear function that associates to P 6 £ p its value at 
z € D is continuous. The second equation is an easy consequence of the fact 
that {no, ni, . . . , n p } is an orthonormal system for £ p . □ 

As a consequence of Proposition 15.4.1, we have: 

THEOREM 15.4.2 (The Christoffel-Darboux formula) 
Ifz,£, e Bthen 



? m p (Qm p (z) - Zn p {£)zn p (z) 
h^{z) = — = 

where n p , m p are defined as in lemma 15.2.1 and E p is defined as in proposi- 
tion 15.4.1. 

Proof: Let Q E £ p ~\ and z,£ E P. Using the definition of E p and the fact 
that Vp+i is an isometry, we obtain that 

= ((ei - OQ, H)p = (eiQ, (1 - hi)^)p+i- 
Using the fact that TZ P — Span{eiQ : Q € £ p -i} we obtain 
(1 — £ei)J5p € Sp+iQllp the orthogonal complement of the subspace 7£ p with 
respect to the (p + l)-dimensional space £ p +i. On the other hand, 
e\n p ,E^ E £ p +i © 'R-p and since both polynomials have different degree they 
generate the at most 2-dimensional space £ p +i Q 7Z p . Therefore, 

(1 - lei)El = aE° + be x n p . 



in p , 



By (4.1), b = -|n p (£) and also, 

(l-|ei)^ = a^-f^(Oe 1 » 
which yields 

(1 - £w)E*(w) = aE%(w) - ln p (Qwn p (w), w € C. 
The desired result comes easily from the fact n p = e p rfTp~. □ 

15.5 Description of all spectrums of a stationary process 

The main result of this section is the description of the set of all measures 
fj, absolutely continuous with respect Lebesgue's measure on T and such that 
fi,(k) = Cfc, k — —p,...,p, where {cfc}^_ is a given strictly positive 
definite complex sequence. 

We require some notions of Harmonic Analysis of Operators on Hilbert 
Spaces (cf. [Sz-Nagy, 1970]). 
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Let H be a Hilbert space, V, 1Z two closed subspaces of H and V : V — > 1Z 
an isometry acting on H. We say that a unitary operator U acting on a Hilbert 
space T is a unitary extension of the isometry V if and only if H is a closed 
subspace of .F and U\v = V. If in addition, .F = \j n ^ j U n {^H), we sa y 
that £/ is a minimal unitary extension of F. We identify two minimal uni- 
tary extensions of V, U and U' acting respectively, on the Hilbert spaces T 
and T' if and only if there exists a unitary operator $ : T —* T 1 such that 
$| M = / w and $£/ = U'$. Let .Af, .M be two closed subspaces of the 
Hilbert space H, C(N,M) denotes as usual the set of all bounded linear op- 
erators from Af to M.. An operator valued function 9 : D — > C(M,M) is 

a contractive analytic function if and only if sup ||0(z)|| < 1 and there ex- 

z€B 

ists a sequence {0fe}fe>o C C(M,M) such that Q(z) — Yl, zk ®k, z £ B, 

fe>0 

where the convergence is in the operator norm. The Schur's class, S(J\f, M) 
is the set of all contractive analytic function O : D — ► C{N,M). The Arov 
and Grossman functional model (cf [Arov, 1983], [Marcantognini, 2000]) 
establishes the existence of a bijection between the unitary extension of an 
isometry V : T> — ► 1Z acting on Ti, indistinguishable from the geometric 
point of view, and the class of Schur S{M,M), where N,M are the defect 
spaces of V. Given U € C(F) a minimal unitary extension of the isometry V, 
: D — > C{M, M) defined by 

e(z) = p&uii - zp£ en ur%, z€B 

is a function in the Schur class, and the relation is bijective. When U and 
are related as above, we denote U = Uq and T = Fq. 

We use this theory for the particular case when H = (£ p , {, ) p ), V = V p , 
11 = TZ P , V = V p , M = M p and jV = J\f p . We recall that in this case M. v and 
AJ"p are 1-dimensional subspaces of S p and therefore there exists a bijection 
between the Schur class S{M. P ,N P ) and the closed unitary ball of if°°(D). 
In the other hand U € C{F) is a minimal unitary extension of V p if and only 
iff/ -1 € £(F) is a minimal unitary extension of V p *. Consequently, there 
exists a one to one correspondence between the minimal unitary extension U~ x 
of the isometry V* and the functions of H°°(1D), such that ||if||oo < L i 11 
order to recall the relation between a fixed H and a minimal unitary extension 
U- 1 G £(F) we set U~ l = U^ 1 and T = F H . 

LEMMA 15.5.1 For each p 6N and P € (Sp, {., .) p ) then 

(('-'WZFp.".),-^ 

where V p is the isometry andm p is the function given in lemma 15.2.1. 
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-1 



Proof: Let Q := Q(z, () = (/ - zVjPfy) P. 
%Q = P thus 

Q - ze-iQ + ze-i(Q,m p ) p m p = P 



Then Q - zV*P^ p Q = P thus 



and 

_ P - ze-i(Q,m p ) p m p 

W ~ 1 - ze- X 

The result follows from {Q, m p ) p = ip/A and 

= P(z) - (Q,m p ) p m p (z) + (Q,m p ) p m p (0). 

□ 
We will use the following lemma, the proof can be seen in [Arov, 1983] or 
[Marcantognini, 2000]. 

LEMMA 15.5.2 Given H € H°°(B) such that \\H\\oo < 1. If Ujj 1 e 
L{Tn ) zs ^e minimal unitary extension ofV* : TZ P — > 2? p related to H then, 

pf: {i - zu?)- 1 \ £p = (i-z typ* + ph(z)px)Y 1 

The following lemma establishes a useful relation between (3h{z) and H. 
LEMMA 15.5.3 If H £ S(M p ,Af p ),then 

(/ - .MDPfr (/ - x^)-) " eo = «.+ m>w ^, K(l) "> 

Proof: We use that: if A, 5 € C{F) and p||, ||5|| < 1 and \\A + B\\ < 1 
then, 

(/-(A + J B))- 1 = (7-A)- 1 (/-5(7-A)- 1 )" 1 , (5.1) 

to check that 

Qh-.= Qh{.M=(^-zPh{z)P% p (i-zV;pQ)~ 1>S ) eo, 



Let 
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exists. Hence 

Q H - zf3 H {z)P% p (i - zV;P%)~ 1 Qh = eo, 
thus, we obtain 

Q H -zH{z)Ul-zV;P^~ l Q H ,rn p \ n p = e . (5.2) 

d=((l-zV;P% p y 1 Q H ,m p } 

therefore, if we apply to both members of equality (5.2) the operator 

(i-zv;p% p )~~ l 

and take the scalar product with m p , we obtain 

d-zH{z)d((l - zV;P% p )~' n p ,m p ^ = ^(j - zV^)' 1 e Q ,m p ^ . 

>From lemma 15.5.1 we have 

d - zH(z)d- t 1 — = — T-r = — p-r. 
m p {z) m p (z) m p (z) 

Therefore d[m p (z) — zH(z)n p (z)] = 1 and 

m p (z) - zH{z)n p {z) ^ (5.3) 

thus 



m p (z) — zH{z)n p (z) 

The result follows easily. D 

As seen in the previous lemma, H = is simpler than the others cases. We 
study such a case in the next proposition. 

PROPOSITION 15.5.4 If /jP is the spectral measure related to Uq the 
minimal unitary extension ofV p associated to H = then, 

where hq(s) = {/U°(s)eo,eo),s € [0, 2tt], 
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Proof: As consequence of lemma 15.5.1 and lemma 15.5.2 and the Spectral 
Theorem we have that 

1 = \( /_2V p jP S) m P' m p) =(( I - zU o 1 )~ lm P> m p) jro 

1 f 2n 1 
= 2^/0 I3^^ 0(t)mp ' mp> 

I /-27T I 



Whence ^ /gV^lmp^^l^^Weo.eo) = $o( k ) and therefore 
|m J ,(e ft )| 2 d{/x°(t)eo,e ) = rft. 

□ 
The next proposition shows that there are some spectral measures of 
Uh € C(Fh) the minimal unitary extension of V p that are absolutely con- 
tinuous with respect to dt. 

PROPOSITION 15.5.5 Given H e H°°{B) such that \\H\loo < 1, let 
/j, be the spectral measure of Uh € L{Th) the minimal unitary extension of 
V p : Dp — ► TZp associated to H. If /i#(t) = {l^ H {t) e 0i e o) then fiH verifies: 

1 f 2n e u + z _ 2zH{z) £P_ 

2ir J e lt — z m p (z) — zH(z)n p (z) m p (z) 

1 f 2 "e u + z 1 



1 r M e lt 

"2^ J Q e^ 



zK(e«)|2 



Proof: Let Ajj(z) = /( 7 + zU H ) [I — zU H ) eo, eo ) , then by lemma 
15.5.2 we have that 



A H (z) = 2{[[l-zU H l ) 1 e ,eoj - (e ,e 

= ^{{l- z (v^Pn P p +PH(z)PX)y 1 eo,e ^ -(e ,e ). 
We recall (5.1) and we obtain 
A H {z) = 2l(l-zV;P%)~ 1 (i - zP H (z)P% p (l-zV;P^)~ 1 ^ e , 



eo ) 



eo,eo, 
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From lemma 15.5.3 and the Spectral Theorem we conclude that 

zH{z) 
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A H {z) = 2(jl-zV;p£)~ l (e Q + 



m p (z) - zH(z)n p {z) v ) ' °, Th 



-*(v-*;r l (*«+ mp(8) i% )nr(z f >) -*°) 



(eo,eo)_ 



eo,eo) 



_i_ / 27r { 1 
~ 2tt Jo 1 1 - «e-« 



2 + 



2*#(z) 



~ 2tt y o 1 - ze-»* 



m p (z) — zff (z)n p (z) 
2zH{z) 



n p (e u ) 



(1 + ze-") + 



m p (z) - zH{z)n p {z) 



K(e«)| 2 
n p (e«) 
1 



dt 



\m p (e 



in 12 



dt 



27r./o e«-z|mp(c*)| 2 + 



1zH{z) 



m p (z) — zH(z)n p (z) m p (z) ' 



a 

The following corollary gives a necessary and sufficient condition in order 
that fj,ff be absolutely continuous with respect to Lebesgue's measure. 

COROLLARY 15.5.6 Given H E H°°(B) with \\H\loo < 1, fef m the 
measure defined in the previous proposition. The measure /z# is absolutely 
continuous with respect to the Lebesgue measure on T, with density /# given 
by 



MO = 



m p (Q + tH{Qn p (0 ] 
m P (Or^[m p (0-CH(On p (0 



: Re 



if and only if the set {(, € T : |H(C)| = 1} has Lebesgue measure zero and 
l/(m p - C,Hn p ) € L 2 . 

Furthermore, ifH is inner then im is singular with respect to Lebesgue 's 
measure on T. 

Proof: Let H 6 H°°(D) with \\H\loo < 1. We know from (5.3) that 
m p {z) — zH(z)n p (z) ^ 0, z € D and then the set 



{C€T :m p (C)-C^(C)n P (C) = 0} 
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has Lebesgue measure zero. Therefore, from the last proposition 



1 f 2w e u + z 

lira — / Re—. du,f{(t) = Re 

z-yC, 2tt J e lt -z ^ w 



2H(0C P+1 



mp(C)K(C)-C#(CK(C)] 

+ 



1 



K(C)I : 



; Re 



,^(C)I 2 

m p (C) + C^(CK(Q ' 

mp(C)-C^(CK(0. 



i-l#(0l s 



K(c)-C#(CK(C)I 2 



= /h(C) a.e. 



Fromthefact; 1-II^H^ < l-|if| 2 < 1, weobtain f H g L 1 ifandonlyif 
act, 

1 - \\H\\L 



mp-CHn, € L2 ' in faCt ' 



|m p - C,Hn v \ 2 



<f H < 



\m v - C,Hn p \ 2 ' 



Moreover, fjj is positive a.e. if and only if the set {£ g T : |i^(C)| = 1} has 
Lebesgue measure zero. 
If// is inner, let 



M(z):=c -Ar&.w 



Since fie 



" m P (C)+C^(C)n P (C 
m P (C)-CW(C)n P (C 



= 0, it results 



lim|Af(z)| = l 



and so ^i# is singular respect to Lebesgue's measure on T (cf. [Rudin, 1979]). 

□ 

REMARK 2 If\H\ < a < l,a.e. then is very easy to check that the Lebesgue's 
measure ofthe set {Q € T : |i?(C)| = 1} is nulland l/(m p — ^Hn p ) € L . 

15.6 On covariance's extension problem 

First, we state the covariance's extension problem: Given p € N, and 
c o> Ci, • • • , Cp, complex numbers with cq > and c_^ = c~k, k = 1, . . . ,p 
find a nonnegative finite measure /ion T suchthat 



27ri 



e * fct d/i(£) = c fc) fc = -p, ••■ ,p. 



(6.1) 
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The following proposition gives the conditions on Co, c\, ■ • • , c p , in order that 
there exists of a nonnegative finite measure p, on T such that (6.1) is satisfied. 

PROPOSITION 15.6.1 Let p e N. If{c k } p k =_ p Q CwMcq > and 
c_fc = Cfe, such that there exists \i, a positive measure absolutely continuous 
with respect to Lebesgue's measure on T with Cf. = p,(k), k = — p, • • • ,p. 
Then {cfc}^__ is strictly positive definite sequence. 

Proof: Since there exists pi, a positive measure absolutely continuous with re- 
spect to Lebesgue's measure onT such thatcfc = p(k), k = —p, • • • ,p, for 



P P 1 P P f27T 

„_n»n— n o-fim=n ■'0 



n=0 m=0 n=0 m=0 



27r7 



p 



J2\ n e int \ dp>0 



n=0 



On the other hand, denote -^ — f , where / is a positive Lebesgue's integrable 
function, let {A n }^ =0 C C — {0} and assume 



p p „ l / >27r 

±*£i 27r Jo 



n=0m=0 



£A n 



e int 



2 

/(e tt )dt. 



ffQ(e ft ) := En=o A ne int th en |Q(C)| 2 /(C) = 0, a.e., that is there exists 
a Lebesgue measurable set A such that \A\ = 1, (where |j4| denotes the 
Lebesgue measure of A) and |Q(C)| 2 /(C) = 0, £ € A Let 
X = {C € T : Q(C) = 0}, if Q is not the null polynomial then | X \= 
and so, /(C) =0, C, e A-X, with j 4 - X |= 1 then < Co = ju(0) = 
/(0) =0. D 

The main result of this section is the following theorem. It is important, 
since characterizes a strictly positive definite sequence as a finite number of 
Fourier's coefficients of a measure pi which is absolutely continuous with re- 
spect to Lebesgue's measure. However, the theorem also gives the Radon- 
Nikodym derivate of \i, establishing a 1-1 correspondence between the den- 
sities and a subset of the open unitary ball of H°°(B). Finally, we present a 
factorization formula. 

THEOREM 15.6.2 Let p E N and {c„}£ = _ p C C, the following condi- 
tions are equivalent: 

i) En=0 Em=0 ^n^m.C m -n > 0, {A n } n=0 C C — {0}, 
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ii) There exists a positive Lebesgue 's integrable function f on T such that 
Ck = /(&). k = -p,--- ,p. 

Moreover, given H E H°°(B) such that \\H\loo < 1, the set {(eT: \H(Q\ = 
1} has Lebesgue measure zero and \/{m p — C,Hn p ) € L 2 , we define 



M0 = MOF^ e 



m p (C) + Cff(C)np(C) 
m p (C)-C^(CK(C)J 



, C e T (6.2) 



then 

f H (k) = c k , k = -p,--- ,p. 

Furthermore, the relation (6,2) establishes a bijection between all the power 
spectrum that solves the covariance extension problem and the H 6 H°°{&) 
verifying that the set {( £ T : |-H"(C)I = 1} has Lebesgue measure zero and 
l/(m p — C,Hn p ) € L 2 . Finally, the following factorization formula holds: 

f H = \F H \ 2 for some F H eH°°. 

Proof: As a consequence ofproposition 15.6.1 if statement (ii) is valid then (i) 
is true. Assume that (i) holds and let V p : V p — > TZ P be the isometry defined 
in lemma 15.2.1. Given H 6 H°°(3) such that \\H\loo < 1, the set {( e 
T : |-ff(C)| = 1} has Lebesgue measure zero and l/(m p — C,Hn p ) € L 2 , let 
Uh 6 ^{^h) be a minimal unitary extension of Vp associated to H. Clearly, if 
k = 0, 1 • • -p and p, H is the spectral measure of the unitary operator Uh then, 
as a consequence of the Spectral Theorem 

1 f 2n 
c k = c~k = (e , e k ) p = (e , K fc e ) p = (e 0) U H e Q )r H = — / e- ikt d/j, H (t) 

Ik Jo 

where /i#(£) = (p. H (t)eo,eo). The desired result is a consequence of corollary 
15.5.6. The others statements of the theorem can be easily proved. □ 

The following corollary shows that the set of all solutions of the Covariance 
Extension Problem that we have obtained contains strictly the set of all densi- 
ties in the Wiener class (cf. [Dym, 1988], [Woerdeman, 1989]). The proof is 
very easy. 

COROLLARY 15.6.3 Given H e H°°(B) such that ||if ||oo < 1, the set 
{C € T : |.HXC)| = 1) has Lebesgue measure zero and l/(m p — C,Hn p ) € L 2 , 
define 



m) = kW* 6 



m p (C) + C^(C)n P (C) 
m p (C) - C#(CK(C) 



CeT. 
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Then, /# £ W if, and only if, 

Re Cnp(C)g(0 g w 

m p (C)-Cn P (C)tf(C) 

In 1993, Gabardo(cf. [Gabardo, 1993]) defines the function 

(1 - \a\ 2 )\\E a \\ 2 



\l-ae- i9 \ 2 \E°(e i9 )\ 2 

where a 6 D and £?" is defined as in proposition 15.4.1. He proves that 

W a (k) — Cfc, k = 0, l,--- ,p. Furthermore, he shows that when a = 
the function W a maximizes the Burg maximum entropy functional. The 
following corollary shows that the function W a can be obtained from (6.2) for 
some H. 

COROLLARY 15.6.4 Given a£D, the functions W a can be obtainfrom 
(6.2), in the particular case, when H is the constant function 



an p (a) 
H a {z) = 

m p {a) 

Proof: Let a € D. From proposition 15.4.1 and the Christoffel-Darboux 
formula we obtain that 



and 



\EX = (E%,E*) P = E£(a) = ! P ^' 1 _ | '; | ' 2 ' 



(1 _ | n .l2\ |m p (a)| 2 -|Q| 2 |"p(Q:)l 

h Qr c ie|2| ^p( a ) m p( e ' 9 )-e' 9 Qnp(a)n p (e' 9 ) p 
I I I 1— ae* e I 



|a"p(Q)[ 2 
I™p(<*)P 



|m p (e ie )-e^S^n p (e^)|2 

□ 
If we assume that 1 = cq, . . . , Cp are the correlations of a second order 
stationary process X — {Xfc}fc £ z,then 



^]Pa,a fc c fe _j = ^^ajdkEiXjXk) - E 
j=0 fe=o j=o fe=o 



/^k x k 



fc=o 
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that is , the sequence cq = 1, . . . ,c p is strictly positive definite. As a con- 
sequence of the previous theorem there exists an integrable function / such 
that 

c fc = — J e- ikt f(e u )dt, fc = -p, • ■ ■ ,p. (6.4) 

In this case, / is called the spectrum of the process X. According to (6.4) is 
immediate that the application X^j — ► ej , j 6 Z establishes a unitary isomor- 
phism between Hx = Span{Xj : j G Z} and L^(fdt). Whence, 

(ej,ejk) p = Cfc-j = (X k ,Xj) H x = {X-j,X- k ) H x = {ej,e k )L*(fdt) (6-5) 

Let /, A; € N, A; > / and iifyfc := <Sp£m{X_ n }^ = j be a subspace ofH x . The 
innovations are defined by 

£p = -^-p ~ PHi, p - 1 X- p , £ p ~Xo — Pff lp _ 1 Xo, 

and they verify on account of (6.5) it is readily obtained that 



= (V p np~i,m p -i)p = 7 P . (6.6) 

The last equality is clear that the 7 p are called partial autocorrelation coeffi- 
cients when they are as in formula (3.2), known as Levinson's algorithm. We 
set <7^ := ||e p ||^x then 

Using formula ( 3.3) it follows that a^ — <7p_ x (l — \lp\ 2 )- 

15.7 Burg's Entropy 

In this section we use the functional model of Arov-Grossman (cf. [Arov, 
1983]) to find the density of a second order stationary process that solves the 
maximum entropy Burg's problem (cf. [Burg, 1975]). 

The next theorem gives the solution of the main problem stated in the intro- 
duction of this paper. 

THEOREM 15.7.1 Let p € N and {cfc}^ =0 be the first (p + 1) autocorre- 
lations of a second order stationary process X = {X k }keZ then the density /o 
ofX which maximizes Burg's functional e(/) restricted to the conditions 
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1 /" 27r 

_L / e -ikt f{e it )dt = Cfc> fc = ,... )P 

^7T 7 



ZS 



/o(e«) = 



r , t€ [0,2tt] 



|n p (e^)| 2 ' 

Proof: Let p e N and {cfc}^ =0 , be the first (p + 1) autocorrelations of a 
second order stationary process X = {Xk)k&- We use theorem 15.6.2 to 
conclude that there exists a measure /i, absolutely continuous with respect to 
the Lebesgue measure on T that satisfies the conditions 

I-2TT 



-f 



e~ l dn(t) = Cfc, k 



-P» 



,P- 



Then it has a density ^ = f H where the density /# is the one stated in the last 
theorem and H <E H°°(3) such that \\HWoo < 1, the set {C 6 T : \H{£)\ = 1} 
has Lebesgue measure zero and l/(m p — C,Hn p ) € L 2 . Therefore, if there 
exists a maximum of e it has to be of form s(fn) with // verifying the previous 
conditions. Thus, we have that 

1 „ [ m p (Q + CH(On p (Q - 

m p (C)-Cff(CK(C). 



MO 



\m P (C)\ 2Re 



therefore 

r 27T 



I /-27T 1 /-27T / 

-J logfate")*^/ (logfle 



1 f 2n 

= ^J0 lOgRe 



m p (e it ) + e it ff(e it )n p (e it ) 
m p (e it ) - e«i?(e it )n p (e it ) 

-log|mp(e tt )| 2 )dt 

m p (e i *)+e* t ff(e it )n p (e it )' 1 



_m p {e it )-e it H{e it )n p {e it ) 

1 /■2T7 



(ft 



+ 



2tt 



/ log/ (e tt )dt 

JO 



>From Jensen's inequality and Cauchy's formula we obtain 



2nJ 



logRe 



m p (e it ) + e rt ff(e tf )np(e tf ) 
m p (e it ) - e it H{e it )n p {e it ) 



dt 



1 /* 2?r 
= log Re±-.f ^4± 



m p (e if ) + e tt fl"(e tt )np(e ft ) 



m p (e u ) - e it H(e it )n p (e it )_ 
m p {z) + zH(z)n p (z) dz 



dt 



zH(z)n p (z) z 



= 0. 
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REMARK 3 Other entropy functional different to the Burg was used by Ga- 
bardo (cf. [Gabardo, 1993]). He proves that ifp, = ///■ + Ms (Us is a singular 
measure) satisfies (1.1), then 



^r iog[/ " (e " ,i iT^ j ¥* ^r iog[Myefl)| F^v 



Another way to characterize the solution of the maximum entropy problem, 
is the one given by Arocena (cf. ([Arocena, 1990], [Arocena, 1990A]).) We 
obtain such result as an easy consequence of the fact that if U € C-(F) is the 
minimal unitary extension of the isometry V v associated to H = 0, then, 

U k M v 1 5 P , ke N. (7.1) 

Therefore if p,(t) = (p,(t)eo, eo) , where fi(t) is the spectral measure of the 
unitary operator [/then 

m) = K^W dt 

Conversely, if (7.1) is true for a minimal unitary extension U of V p then U 
corresponds to H = 0. 

We know from [Azencott, 1986] that the application X-j — ► ej, j € Z is 
a unitary isomorphism from Hx = Span{Xj : j € Z} to L 2 (T,p,x), where 
/J>x(t) is the spectral measure of the process X. It is a known result that if 
f*x(t) — fo(e lt )dt, then there exists an autoregressive process AR{p) given 
by 

CLqXq + • • • CLpXp — £p 

with {e n }neZ a white noise. The latter shows that the maximum entropy solu- 
tion which is obtained when H = 0, and has the form fo{e lt ) — , u*t\\i an d 
this is the spectral density of an autoregressive process ofp order AR(p). 
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Abstract We survey recent results in geometric analysis which explicitly involve both the 

geometry ofRiemannian manifolds and probability. We include developments in 
spectral geometry, the study of isoperimetric phenomena, comparison geometry, 
minimal varieties, harmonic functions, and Hodge theory. 

Keywords: Spectral geometry, isoperimetric conditions, comparison theorems, minimal va- 
rieties, harmonic functions, Hodge theory 

16.1 Introduction 

The first task of a survey concerning results in geometric analysis is to limit 
the scope of the project by creating a theme which provides a focus and is of 
interest to a reasonably large audience. The theme which runs throughout this 
paper can be concisely stated: the material reviewed in this survey explicitly 
involves both the geometry of (finite dimensional) Riemannian manifolds and 
probability. 

The second task of a survey concerning results which bridge a number of 
topics is to choose a perspective from which to work. We choose to treat 
geometric phenomena as primary in our organization of the material. Thus, the 
paper is broken up into sections, each of which focusses on a specific category 
of geometric problems. Inside each of these categories we discuss a variety of 
related probabilistic results. 

It is now common knowledge that there are a number of important con- 
structions which tie together analysis, probability and geometry. For example, 
associated to a Riemannian manifold there is a natural differential operator 
(the Laplace-Beltrami operator), which is defined in terms of the underlying 
geometry of the manifold, and in turn serves as the infinitesimal generator for 
the natural diffusion process on the manifold (Brownian motion). Because the 
Laplace operator is closely related to the metric, solutions of the fundamental 
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partial differential equations and boundary value problems (Dirichlet problem, 
heat equation, etc) and the associated constructions (spectrum, eigenfunctions, 
etc) contain a great deal of geometric information related to the underlying 
manifold. Because it is possible to use the path properties of Brownian mo- 
tion to give probabilistic representations of the objects constructed to study 
the fundamental boundary value problems, there is hope that the techniques of 
modern probability can be brought to bear on questions involving the geometry 
of the underlying manifold. History bears this out; the results which follow are 
part of this record. 

There are many connections between analysis, probability and geometry 
in addition to those described above. All of these connections are united by 
a common thread: The metric gives rise to objects belonging to each of the 
three categories (eg, the Laplace-Beltrami operator, Brownian motion, the Rie- 
mann curvature tensor). One moves between categories by constructing iden- 
tities/inequalities in one category using the objects of another. We have orga- 
nized the material to reflect this fundamental logic. More precisely, in each of 
the sections that follow, we define a collection of geometric/analytic problems 
by reference to a Riemannian metric. Citing relationships between the prob- 
lems of a given section and modern probability (relationships usually afforded 
by the metric), we sketch results which occur as corollaries, with implications 
in both directions. 

Given that all results depend on familiarity with the basic construction in 
each of the categories, we include a short exposition of the material common 
to all topics. It is hoped that in addition to fixing notation, this exposition 
makes the paper relatively self-contained. Given that this is a survey, proofs 
are for the most part omitted, with appropriate references sufficing. 

The paper is organized as follows. In section 2 we establish notation that 
will be used throughout the paper while reviewing the background material 
in analysis, probability, and geometry. In section 3 we study the geometry of 
balls and tubes in Riemannian manifolds. Much of section 3 revolves around 
the study of the asymptotics of exit time moments of Brownian motion, al- 
though we also review results invovling cover times and principal curves. In 
section 4 we review results related to spectral geometry. While much of sec- 
tion 4 is related to the relationship between Dirichlet spectrum and various 
norms of exit time moments of Brownian motion, we also review material in- 
volving coupling techniques and estimates for a variety of problems involving 
a spectral gap. In section 5 we focus on topics related to isoperimetric phe- 
nomena and comparison geometry. Again, we study results involving exit time 
moments for Brownian motion, as well comparison phenomena involving tran- 
sience/recurrence of Brownian motion. In section 6 we study minimal varieties 
(ie varieties which arise as solutions to geometric variational problems). In sec- 
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tion 7 we review material involving harmonic functions. Much of this section 
is devoted to results involving the study of Martin boundaries and natural ex- 
tensions to the theory of harmonic maps. Finally, in section 8 we review work 
involving Hodge theory. 

Because we have chosen to limit the scope and organize the material as 
sketched above, we do not include many results which could certainly be 
counted as explicitly involving modern probability and geometric analysis. In 
particular, we have not included material involving the largely parallel theory 
of random walks on graphs, nor have we included results which involve the (in- 
finite dimensional) geometry of path spaces. We have not reviewed results us- 
ing the Malliavin calculus, nor have we included material which involves pro- 
cesses on Euclidean domains when that material does not clearly indicate that 
there is an underlying geometric phenomena being studied. Most regretably, 
we have not included material involving index theory where Bismut's proba- 
bilistic techniques have led to important results for both the geometry of Rie- 
mannian manifolds and the geometry of their loop spaces (for those interested 
in this material, see the survey [Bismut, 1986], the article [Jones, 1997] and 
references therein). 

As is clear from the outline of the paper, one could devote several volumes to 
any one of the topics we survey (and others have). This survey is not intended 
as a comprehensive review of any of the topics, let alone all of the topics. 
Rather, we have attempted to provide enough information on each topic to give 
the reader a feel for new results in the context of specific developmental trends. 
Given limitations of space and time, decisions concerning what material to 
include must be made. Given imperfect knowledge, there are bound to be, in 
addition to the choices dictated by our choice of focus and obvious constraints, 
a number ofunintentional sins of ommision for which we apologize in advance. 

16.2 Notation and Background Material 

Throughout this paper, M will denote a smooth n-dimensional manifold 
with Riemannian structure g. We will write C°°(M) for the space of smooth 
functions on M. We will denote by TM the tangent bundle of M. As a point 
set, 

TM = \_\ T X M 

xeM 

where T X M is the space of tangent vectors to M at x, a vector space of dimen- 
sion n. There is a natural (projection) map it : TM — > M which associates to 
a tangent vector V € T X M the point at which it is a tangent vector ir(V) = x. 
The space TM carries a natural smooth structure for which the projection map 
is smooth; it is a manifold of dimension 2n, a vector bundle over M with 
fiber at x € M the vector space T X M. Smooth sections of the bundle TM are 
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smooth maps s : M — ► TM which satisfy tt(s(x)) — x. A smooth section of 
the tangent bundle is just a smooth vectorfield on the manifold M. 

If X = (Xi,X2,- . .X n ) gives local coordinates near a point a; 6 M, the 
tangent space at x is spanned by {5^7}"=! and the cotangent space at x, de- 
noted T*M, is the dual space to T X M and is spanned by {dXi}^ =l (the collec- 
tion of objects dual to {g^-}^). We denote the cotangent bundle by T*M; it 
is constructed as was the tangent bundle as a disjoint union of vector spaces: 

T*M = [J T*M 

The tangent bundle also carries a natural smooth structure; it si a manifold 
of dimension In. 

For k < n, we will denote by A k T*M, the A;th exterior power of T*M.\i I 
is a fc-multinomial, / = {i\ , 12, ■ ■ ■ , ik), and dXj = dX^ A dXi 2 A • ■ • A dX ik , 
then {dXj : I increasing} is a basis of A k T*M. As in the construction of the 
tangent bundle, we can endow the disjoint union 

A k M = [J A k T*M 

x&M 
n 



making it a vector bundle of dimension I , I . We denote by Q k = C°°(M, A k ) 

the smooth sections of the bundle of kth exterior powers (the A; -forms on M). 
Those interested in the details should consult any one of the many references 
to this material, eg [Dubrovin, 1984]. 

Given a point x € M, the Riemannian metric g is a nondegenerate quadratic 
form on the space of tangent vectors at x which varies smoothly in x. We will 
often write the metric as (gij) by which we intend to communicate that it can 
be viewed locally as an n x n matrix relative to a choice of local coordinates. 

Given two Riemannian manifolds (M, g) and (N, h) , and a smooth map 
(j> : M — ► iV, we will denote by Dcf> the induced map (derivative) on tangent 
spaces: D<p : T X M — > T^^N. We say that M and TV are isometric if there is 
a diffeomorphism <f> : M — > N satisfying g = (f>*h, where 4>* is the pullback 
operation: 

gx(Vi,V 2 ) = h m (D<l>{V{),D<l>{V 2 )). 

We say that M and TV are locally isometric if at each point we can find neigh- 
borhoods of M and TV which are isometric. We say that a Riemannian manifold 
is locally flat if it is locally isometric to R n . 

Given a function / e C°°(M) and local coordinates as above, we can de- 
fine a 1-form by the local formula df = ^ -gj^dXi. This map, the exterior 
derivative on functions, is defined similarly on all form bundles and denoted 
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by d : Vt k — > Q k+1 . The adjoint map (defined via the metric) will be denoted 
by d* : Cl k+l — > fi fe .The Laplace-Beltrami operator is then invariantly defined 
by 

A (fc) = dd*+d*d:£l k -^Q k . (2.1) 

When the form dimension is understood, we will denote the Laplace-Beltrami 
operator by A. 

Acting on functions, with local coordinates as above, the Laplace operator 
is given in terms of the metric by 

where g is the determinant of the metric and g^ is the ijth entry of the matrix of 
the inverse of Riemannian metric (g%j). There is a similar form for the Laplace- 
Beltrami operator on forms (locally, the Laplace-Beltrami operator on forms is 
given as a system). 

The metric on M induces a volume form, denoted dg, which in turn induces 
a pairing on the space of compactly supported fc-forms. Let L 2 (tt k , dg) denote 
the L 2 -completion of the compactly supported fc-forms with respect to dg. 
When M is compact, the Laplace-Beltrami operator is essentially self-adjoint 
and thus admits a unique self-adjoint extension to L 2 (fi, dg). When M is not 
compact, the situation is more complicated. For those interested in the general 
details the reference [Reed, 1978] provides the requisite functional analysis. 

Letting A act on the space of compactly supported smooth function on M, 
denoted Cq°(M), we will denote by pM(t,x, y) the heat kernel on(0, oo) x 
M x M. We recall that j>m(*i ^i y) is the smallest positive solution of the intial 
value problem 

dtPM - xApm = on (0, oo) x M x M 
lim ipAt(;V,t) = Sy(-) (2.3) 

£->0+ 

where 5 y is the Dirac distribution with mass at y. 

As is well known, Brownian motion is the diffusion process with transition 
densities given by pm- We will denote by ¥ x , x € M, the probability measure 
weighting Brownian paths beginning at x and by E x the corresponding expec- 
tation operators. We denote by Pt = e~ tA the operator semigroup acting on 
continuous functions on M : 

Ptf{x) = I PM(t,x,y)f{y)dg{y) 
JM 
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where dg is the metric density. We note that Ptf(x) gives the solution to the 
Cauchy problem: 

d t u - -Au = on (0, oo) x M 

lim u(x, t) = fix) on M. (2.4) 

t->o+ 

Analogous remarks hold in the case of A;-forms for the operator semigroup 

Given a domain D C M with sufficient boundary regularity, we can con- 
struct the heat kernel associated to D, denoted po(x,y,t), and an associated 
Brownian motion on D (Brownian motion absorbed at the boundary). Follow- 
ing Kakutani, we can use properties of Brownian motion to solve the funda- 
mental boundary value problems associated to D. More precisely, let X t be 
Brownian motion on M and let r be the first exit time of Xt from D : 

t = int{t >0:XtiD}. (2.5) 

If/ € C°°(dD) andg € C°°(D), then the solution of the Dirichlet problem 

-Au = on M 
2 

u{x) = f(x)onM (2.6) 

is given by 

u{x) = E x f(X T ) (2.7) 

while the solution of the Poisson problem 

-Aw = g on D 
2 y 



u(x) = Oon&D (2.8) 



is given by 



u(x) = -E x 



f T g(X t )dt . 
Jo 



(2.9) 



More generally, if c{x) is sufficiently regular, the solution of 

-Au — cu = q on D 
2 y 



u = fondD (2.10) 
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is given by the Feynman-Kac formula: 

u(x) = -E x f g(X t )exp j- / c(X s )ds| dt\ 

f(X T ))expi-J T c(X s )ds 
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+ E 1 



}] 



(2.11) 



There are, of course, similar formulae for the solution of boundary value prob- 
lems involving the heat operator. 

By choosing g(x) = — 1 in (2.8) and (2.9) we obtain 



E x [r] 



u(x). 



(2.12) 



There are similar expressions for the higher moments given by recursive so- 
lution of the Poisson problems: Writing ui(x) = u{x) for u as in (2.12), let 
u n (x) be the solution of 



2 Au n = 
u n (x) = 



—nu n -\ on D 
on dD. 



Then, as in [Kinateder, 1998] and [McDonald, 2002], 

E*[r n ] = u n (x). 



(2.13) 



(2.14) 



There are closely related parabolic results: consider the special case of (2.4) 
with / taken to be the constant function 1 on the interior of A on the bound- 
ary of A and the boundary held at for all time. With po(t,x,y) the heat 
kernel and dg the volume form, we set 

u D (t,x) = / p D (t,x,y)dg(y). (2.15) 

Jd 

Then U£)(t, x) is the solution to the initial value problem 

du D 



\au d 



on (0, oo) x D 



dt 

(l ifxeD 
[0 ifxedD 
OifxedD. 



ud{x,0) 
u D (t,x) 

In addition, up gives the distribution of the exit time: 

u D (x,t) = ¥ X (T>t). 



(2.16) 



(2.17) 
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These observations provide a well-studied means of moving between PDE and 
probability. 

While properly speaking it is the Riemannian metric which defines the cat- 
egory of Riemannian manifolds, it is the Riemannian curvature tensor (which 
measures the obstruction to the Riemannian manifold being locally isometric 
to Euclidean space), and the notion of geodesic upon which much interest is 
focused. Both of these objects are most easily described using the language of 
connections. We recall the basic facts: 

A connection on a manifold M is a differential operator 

V : C°°(M,TM) x C°°(M,TM) -» C°°{M,TM) 

which for any / € C°°(M) satisfies 

Vy./Fi = fVy 2 Y 1 + Y 2 (f)Y 1 . (2.18) 

A connection which satisfies 

W^-Vr^i = [Yi,Y 2 ] (2.19) 

is said to be torsion free. Given a Riemannian metric g, a straightforward com- 
putation establishes that there always exists a unique torsion free connection, 
V : C°°(M,TM) x C°°(M,TM) -» C°°(M,TM), compatible with the 
metric in the sense that 

Y 3 (Y lt Y 2 ) = (Vr 3 ri,F 2 ) + (y 1) Vy 3 y 2 ) (2.20) 

where the pairing is defined by the metric. The torsion free connection sat- 
isfying (2.20) is called the Levi-Civita connection. The Levi-Civita connec- 
tion defines, forYi, Y 2 € C°°(M,TM), a curvature operator R{Y\, Y 2 ) : 
C°°(M, TM) -» C°°(M, TM) : 

R(Yi,Y 2 ) = V^V^-V^V^-V^^j. (2.21) 

From (2.21) it is clear that R{Y\,Y 2 ) = — R(Y 2 ,Yi) and thus the curvature 
operator is a tensor that takes values in the skew-symmetric endomorphisms 
of the tangent bundle. The curvature operator defines the Riemann curvature 
tensor Rjkim whose components relative to a basis {Yi} of the tangent space 
T X M are given by 

Rjkim = (R(Yj,Y k )Yi,Y m ) (2.22) 

where once again the pairing is given by the metric. 

We can use the Levi-Civita connection to express the Laplacian on 
fc-forms (the Weitzenbock decomposition): 

A (fc) = y* v + <£fc £.23) 
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where 1Z k , the Weitzenbock curvature term, is given by certain components 
of the Riemann curvature tensor (for k = 1, TZ = Ric, the Ricci curvature 
(2.24)). Such a decomposition was exploited by Bochner to relate the structure 
of the space of harmonic forms and the underlying geometry and topology of 
the manifold (cf [Goldberg, 1962] for a variety of examples). 

Taking appropriate contractions of the Riemann curvature tensor, we ob- 
tain well-studied invariants of the Riemannian metric. For example, the Ricci 
tensor is the 2-form defined by 

Ric(^.yi) = Y. R okik (2-24) 

k 

while the scalar curvature is defined by 

S = ]TRic(n,n). (2.25) 

k 

The sectional curvature associated to a two-plane in T X M is given by choosing 
a spanning set for the two plane, say {Yj, Yfc}, and defining 

Sectional curvature generalizes the notion of Gauss curvature for a surface in 
three space, and one can recover the Riemann curvature tensor from knowledge 
of all the corresponding sectional curvatures. The relationship of sectional 
curvatures to the Ricci curvature is particularly useful: Suppose that V 6 T X M 
is a unit vector and suppose that {li}" =1 is an orthonormal basis of T X M with 
Y n = V. Then, from (2.24), 

n-l 

Ric(V.V) = Y,K(Yi,V) (2.27) 

i=i 

from which we conclude that, for any unit vector V, Ric(V, V)/{n — 1) is the 
average of the sectional curvature of all the two-planes containing V. 

Given two points x, y € M, we denote by C xy the collection of smooth 
curves 7 : fO, 1] — ► M satisfying 7(0) = 2 and 7(1) = y.In local coordinates 
we will write ~/(t) = (71 (t), . . . , 7n(*))- Denoting the tangent vector to 7 at t 
by 7(£), the length of gamma is given by 



K-y) = [\w),i(t))dt 

Jo 



where the pairing is given by the metric acting on the tangent space Xyf t \M. 
The distance between x and y is defined by 

dist(x.y) = inf l(y). (2.28) 

jec X y 
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Fixing x, if y is near a;, the distance between x and y is realized by a smooth 
curve 7 = j xy which minimizes the length function. To obtain 7 one can 
compute the Euler-Lagrange equation associated to the length functional. This 
gives a system of second order ODEs for the components of 7 : 

dP + ^ dt dt U U ' Zyj 

where the functions r*. define the Christoffel symbols. The Christoffel sym- 
bols can be written in terms of the connection and, in turn, the Christoffel 
symbols give a local expression for the connection (cf [Chavel, 1984]). In par- 
ticular, the Christoffel symbols can be used to define the Riemann curvature 
tensor: 

Returing to (2.29) if we require that the curve be parameterized by arclength, 
we note that, for small times, the associated initial value problem has a unique 
solution. This solution is called a geodesic. We say that a Riemannian manifold 
is complete if the (small time) solution of the initial value problem for (2.29) 
does not explode; that is, the solution of (2.29) exists for all t € [0, 00) . 

Given x € M and v € T X M, we will denote the geodesic with initial data 
(x,v) by j(x, v, t). Given v of small norm, the exponential map, 
exp x : T X M -* M defined by 

e *Px( v ) = / J(x,v,l) (2.30) 

is a diffeomorphism onto its image. Using the exponential map we obtain an 
important set of local coordinates (geodesic normal coordinates) defined by 

(r,0) = (IM|,«/|M|)— >7(*,M). 

It is often the case that computations in geodesic normal coordinates facilitate 
an understanding of both the analysis and the geometry of a given problem. 
For example, if we fix x € M and use geodesic normal coordinates near x 
we can expand components of the Riemannian metric. For v G T X M of small 
norm and {Yj} an orthonormal basis of T X M, 

9ik(v) = ^fc-|W«,lj-)«,n) + 0(H 3 ) (2.31) 

where R is the curvature operator (this exhibits the Riemann curvature tensor 
as the second order obstruction to the metric being locally Euclidean). Simi- 
larly, there is an expression for the volume form: 

det(g jk (v)) = l-^Ric(v,v) + 0(\v\ 3 ) (2.32) 
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where Ric is the Ricci curvature (this exhibits the Ricci curvature as the second 
order obstruction to the volume form being locally Euclidean). 

16.3 The geometry of small balls and tubes 

Let H C K n be a compact embedded submanifold of R n of dimension p 
and for e > 0, let T(H, e) be the tube of radius e around H : 



T(H,e) = {yeR n :dist(y,H)<e} (3.1) 

where dist(y, x) is the Euclidean distance between the points y and x and 
dist(y,H) = inf xe # dist(y, x). In a remarkable 1939 paper which arose to 
address a problem in statistics, Herman Weyl developed a formula for the vol- 
ume of T(H, e) for e small: 



2^V ill fcn.-rHVW 



Vol(T(H £)) = { ' V ^jyrijt - 

^ ' " ±(n-p)!^(n-p + 2)(n-p + 4)..-(n-p + 2j) 

(3.2) 



where kij denote certain curvature invariants of the submanifold H. Weyl's 
formula inspired a great many developments in geometry, statistics and proba- 
bility (the book [Gray, 1990] is devoted to the topic). In this section we focus 
on those developments related to probability. 

We begin by noting that there is an invariant description of the tube around 
H which can be obtained using the normal bundle of H. More precisely, let 
(M, g) be a Riemannian manifold, H c M an embedded compact subman- 
ifold of dimension p. Let NH be the normal bundle of H in M, that is, the 
bundle over H whose fibre at x e H is the vector space 

N X H = {v € T X M : (v, w) = for all w e T X H}. 

Given a point x € H and a unit tangent vector v € N X H, the small time 
solution of the second order ODE for length minimizing curves (see (2.29)) 
gives a unique geodesic starting at x with tangent vector at x given by v. We 
denote this geodesic by 7(3;, v, t), where < t < e(x, v). Allowing x to vary 
in H and v to vary in the unit sphere of N X H, we obtain a family of geodesies, 
all defined up to some time e > 0. For e small enough, the pointset T(H, e) 
defined by 



T(H, e) = {y e M : 3x e H, v € N X H, e y < e such that y = 7(x, v, e y )} 

(3.3) 
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is open in M and diffeomorphic to the zero section ofNH (this is the tubu- 
lar neighborhood theorem and T(H,s) is called a tubular neighborhood ofH 
in M; the corresponding system of coordinates are called Fermi coordinates). 
In this setting there is a result corresponding to Weyl's formula [Gray, 1981]. 

16.3.1. Exit time for Brownian motion 

Given that the construction of a tube is completely geometric, it is possible 
to view Weyls' formula as a special case of a more general program in which 
one studies the asymptotic behavior of various geometric analogs of "volume" 
of a tube. This idea was carried out by Gray and Pinsky who studied the 
behavior of the mean exit time of Brownian motion (integrated over starting 
points in the given submanifold) from a tube of radius e. There are by now a 
number of surveys of this material ([Pinsky, 1991], [Pinsky, 1995]). We sketch 
the main ideas and a few of the main results when the submanifold is a point. 

Thus, let (M,g)be a Riemannian manifold, x € M,and T(x, e)the geodesic 
ball of radius e centered at x. Let Xt be Brownian motion on M, r e the exit 
time of Brownian motion from T(x,e) : 

r e = inf{t > : X t i T(x,e)}. 

Theorem 1 (cf [Gray, 1983]) As e -* 0+ , 

E x [r e ] = c e 2 + Cl Se A 

+ [c2\Rjkim\ + c 3 |Ricij| + c A S 2 + c 5 AS] e 6 + 0(e 8 ) (3.4) 

where the constants C\ depend only on dimension, S is the scalar curvature at 
x, |RiCjj| is the norm of the Ricci curvature at x, \Rjkim\ is the norm of the 
Riemann curvature at x and A is the Laplace operator. 

Using expansion (3.4) one has 

THEOREM 2 (cf[Gray, 1983]) Suppose that (M,g) is Riemannian of dimen- 
sion n < 6. Suppose that for all x € M, 

1.2 



E x [r e ] = -e 2 + 0(e 8 ). 



Then M is locally flat. 



The condition n < 6 suggests that one can do no better. This is a result of 
Hughes: 

THEOREM 3 (cf [Hughes, 1992]) Let S 3 be the unit sphere in R 4 and let H 3 
be three dimensional hyperbolic space. Let (M,g) be the product Riemannian 
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manifold given by S 3 X H 3 and let x € M. For any £ < | , the probability law 
ofr e coincides with the probability law of the exit time of 
Brownian motion from a ball of radius e in R . 

The negative result of Theorem 3.3 indicates that to obtain more geometric 
information from Brownian motion in a small ball, one should consider some- 
thing other than higher moments. A natural choice is the exit place of Brownian 
motion. Using the exponential map, there is a simple representation of the exit 
place distribution as a measure on 5 n_1 . More precisely, we have 

THEOREM 4 (cf [Liao, 1988], [Pinsky, 1995]) Let (M,g) be Riemannian, 
x € M, andexp x the exponential map at x. Suppose that f : S n ~ 1 —* ULis a 



continuous map. Define S e f(x) by 

S e f(x) = E^/Ce^exp- 1 ^,)]. 
Then, as e — ► + , 




Sef(x) = / 

+ 


24 y s „_i L dx k n 


f{0)M 

e k ds' 

+ 2dx k _ 


f(9)d$ (3.5) 



where dO is Lebesgue measure, RiCjj is the Ricci curvature, and S is the scalar 
curvature. 

This expansion gives the following result: 

THEOREM 5 (cf[Liao, 1988]) Suppose that S e and f are as in Theorem 4. 
Suppose that for all x £ M, 



S s f(x)= I f(6)d9 + 0(e 2 ). 



Then M is Einstein. If, in addition, E x [r e ] = ^- + 0(e 8 ), then M is locally 
flat. 

16.3.2. Cover times 

Let G be a finite graph, X n a random walk on G. Define the cover time of 
G by X n , denoted To, by 



T G = min{n : {Xj}] =0 = vertices (G)}. (3.6) 

Cover times appear in a variety of applications in computer science, physics 
and statistics (for a survey, see [Aldous, 1989]). For many such applications, 
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understanding how cover time is related to the underlying structure of the walk 
is an important problem. An example of particular interest is the two dimen- 
sional torus Z 2 = Z 2 /nZ 2 with a simple random walk. In this case, there is a 
conjecture of Aldous (1989) for the asymptotic behavior of the cover time for 
large n : 

T 4 

lim - n = — almost surely. (3.7) 

n-»oo (nlogn) z TT 

This conjecture has recently been settled by Dembo, Peres, Rosen and Zeituni 
using a careful analysis of Brownian excursion on the two-torus T 2 = M 2 /Z 2 
[Dembo, 2001]. More precisely, suppose that Xt is Brownian motion on T 2 
and let e > 0. Let B(x, e) be the ball of radius e centered at x. Let T x<e be the 
time required for Brownian motion to come within e of x : 

T Xt£ = mf{t>0:X t €B(x,e)}. 

Let C e be the time it takes for Brownian motion to come within distance e of 
every point of T 2 : 

C £ = sup{T Xi£ : x € T 2 }. (3.8) 

Thus, C e is the time it takes the Wiener sausage (the e-tube around Brownian 
motion) to cover T 2 . The main result of [Dembo, 2001] is the following 

THEOREM 6 ([Dembo, 2001]) Let X t be Brownian motion on T 2 . Then 

C 2 

lim— ^ = -. (3.9) 



£- 



o |loge| 2 n 



The result generalizes to two-dimensional, compact, connected Riemannian 
manifolds. 

To establish the theorem, the authors control e-hitting using excursions be- 
tween concentric disks. Their techniques as well as their results are of interest 
and will be useful for attacking a wide variety of related problems; for exam- 
ple, the Erdos-Taylor conjecture. 

Given a simple random walk on Z 2 and a point ieZ 2 , let T n (x) be the 
number of times that the walk visits x up to time n. Let 

T* = maxT n (x) 

n xgz2 y 

be the number of times that the walk visits the most frequently visited position. 
It is a longstanding conjecture of Erdos and Taylor that 

T* 1 

lim yz — " .. = — almost surely 

n->oo (\og{n)y IT 

Using techniques closely related to those developed in [Dembo, 2001], the 
conjecture is established in [Dembo, 2001 A]. 
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1633. Principal curves 

Suppose that X is a random vector in R 2 with distribution given by a smooth 
density p. Suppose that r C M 2 is a smooth embedded compact curve and let 
dist(a;,r) be the Euclidean distance from x 6 R 2 to the curve F. Define an 
exceptional set, E, by 

E = {xeR 2 :3 yi ^y 2 € I\ dist(x, T) = dist(x, Vi )}. 

Then E is a set of Lebesgue measure zero. Let 7rp : R 2 \ E — > V be the map 
which associates to each x € R 2 \ E the point on T nearest to a;. A curve Y is 
called principal for the random variable X if T is self-consistent: 

E[X\n r {X) = x] = x 

for almost every x € F Principal curves, first studied by Hastie-Stuetzle [Hastie, 
1989], generalize the statistical notion of linear principal components and are 
designed to give meaning to the idea of a "curve passing through a data set." 

Given a random vector X as above one can formulate a natural variational 
problem for the "best fit principal curve F" by minimizing the expected dis- 
tance squared between F and the vector X, ie by minimizing F(F) where F is 
given by 

F(F) = E[||X-7r r pO|| 2 ] (3.10) 

where the norm is given by the Euclidean distance. Such a program was carried 
out by Duchamp-Stuetzle [Duchamp, 1996] who computed the corresponding 
Euler-Lagrange equation for the functional, finding the critical curves are con- 
strained to have their curvatures given in terms of the first and second moments 
of the induced transverse densities along the normal fibres of the curve F. In 
addition, they found that none of these curves are minima. 

It is possible to formulate the notion of principal submanifolds for a random 
vector in a Riemannian manifold. The corresponding variational problem for 
the expected distant to the principal submanifold leads to constraints on the 
curvature components appearing in (3.2) in terms of moments of the induced 
densities along normal fibers. At present it is unclear whether the notion of 
a principal submanifold can be used to effectively address problems involving 
"statistical shape." What is clear is that these calculations give rise to geometric 
and probabilistic objects which warrant further study. 

16.4 Spectral Geometry 

Let (M,g) be a closed Riemannian manifold (ie M is compact without 
boundary), A : C°°{M) — > C°°(M) the Laplace-Beltrami operator acting on 
functions. Then A is essentially self-adjoint (ie it has a unique self-adjoint ex- 
tension to L 2 (M,dg)) and it spectrum is real and nonnegative. Let 
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R{\) = (A - A) -1 be the resolvent of A at A. By Rellich's theorem R(\) 
is a compact operator on L 2 (M,dg) and it follows from the machinery of 
functional analysis ([Reed, 1978]) that the spectrum of A consists of discrete 
eigenvalues of finite multiplicity with a unique accumulation point at infinity. 
We write the spectrum of A as 

spec(M) = {XeR:3feL 2 (M,dg), A/ + A/ = 0}. (4.1) 

Since the Laplace-Beltrami operator on fc-forms can be treated in the same 
fashion as the Laplace operator on functions, the spectrum of the Laplace- 
Beltrami operator on fc-forms consists of discrete eigenvalues of finite multi- 
plicity with a unique accumulation point at infinity. 

When D C Mis a smoothly bounded domain with compact closure and we 
impose Dirichlet boundary conditions, it is again true that the spectrum of D, 
denoted spec(£>), will behave as it does when M is closed. When M is not 
compact, the behavior of the spectrum of the Laplacian is considerably more 
involved. The majority of our comments are restricted to the case of smoothly 
bounded domains with compact closure. 

In both the closed case and the case of a smoothly bounded domain, the 
fundamental problem of spectral geometry can be stated as follows: 

What is the precise relationship between spec(M) (respectively, spec(D)) and 
the geometry of M (respectively, D)l 

There are a number of good surveys of spectral geometry available (cf [An- 
derson, 1997], [Berard, 1986] and references therein, [Berard, 1986] contains 
an extensive bibliography for results prior to 1985). In addition, there are a 
number of texts which discuss the connections between geometry and spectral 
data (cf [Chavel, 1984], [Schoen, 1994]). We focus on those topics related to 
probability. Our results fall roughly into two classes: (1) results involving the 
use of exit time moments to study spectral geometric objects and (2) techniques 
involving the notion of coupling for studying spectral geometric objects. 

16.4.1. Principal eigenvalue for planar domains and 
torsional rigidity 

Interest in the connection between the geometry of a Euclidean domain and 
the associated Dirichlet spectrum first arose during the 19th century in studies 
involving elastic bodies. In these studies the Dirichlet spectrum of a plane 
domain indexed the allowable modes of vibration of a homogeneous planar 
membrane with boundary held fixed. For such a model, the first Dirichlet 
eigenvalue, giving the lowest allowable energy of vibration, plays a special 
role as it is the dominant factor in studies involving small perturbations of the 
membrane. Counted among the first results of the field is the conjecture of 
Rayleigh (later proved by Faber and Krahn - Theorem 16): 
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THEOREM 7 Let v be a positive real number. Then for all domains D C R , 

No\{D) = v =► Ai(£>) > \i(B) (4.2) 

where B is a disk of volume v. 

The Raleigh conjecture can be viewed from a variety of perspectives. For the 
present we note that (4.2) provides a lower bound for the Dirichlet spectrum in 
terms of geometric data associated to the domain. In this sense the Rayleigh 
conjecture is prototypical of a great many estimates for the principal eigen- 
value (the idea being to bound Ai in terms of natural geometric parameters 
associated to the underlying domain). We review results for which the bounds 
are probabilistic. 

Let Xf be Brownian motion on R 2 , let D C R be smoothly bounded with 
compact closure and let r = rp be the first exit time from D. Motivated in 
part by Hayman's bound for Ai for planar domains in terms of the inradius of 
the domain [Hayman, 1978], Banuelos and Carroll prove 

THEOREM 8 (cfJBanuelos, 1994]) Let DcR 2 and suppose that r is the exit 
time of Brownian motion. Then 

* <A,< 7 « 3 )!, (4.3) 



sup xeD E*[T] - - 8sup xeD E*[T] 

where C( s ) ' 5 the Riemann zeta-function and Jq is the first positive zero of 
the Bessel function of the first type, Jq(x). Ifo~D * 5 the Schlict-Landau-Bloch 
constant of D, then 

-L<supE*[r]<^. (4.4) 

Moreover, the left hand side of (4. 3) is sharp. 

Inequality (4.3) of Theorem 8 states that Ai can be estimated by the 
L^-norm of E x [r] (and inequality (4.4) indicates that there are geometric 
estimates for the L°°-norm of E^fr]). There are similar statements for all 
LP -norms of E x [t], denoted ||E a [T]||p, as well as estimates involving the higher 
moments of r. It should be clear that these norms are all geometric invariants; 
they do not change under the action of the isometry group of the ambient space. 

The L -norm of the first moment of the exit time plays an interesting role in 
the theory. Historically, interest in the ||E x [r]||i first arose in the 19th century, 
again in the theory of planar elastic bodies, where it is proportional to the 
torsional rigidity associated to a homogeneous cylinder with defining cross 
section D. The St. Venant Torsion Conjecture, first proved by Polya [Polya, 
1948], gives a natural geometric bound: 
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THEOREM 9 Let v be a positive real number. Then for all domains D C R 2 , 
Vol(D) = t; => W[r D ]h < ||E x [tb]||i (4.5) 

where B is a disk of volume v. 

The Torsion Conjecture inspired a great deal of analysis and the correspond- 
ing literature is extensive (cf [Bandle, 1986], [Iesan, 1980], and references 
therein). The vast majority of the literature is written from the point of view of 
elastica. Thus, there are a variety of techniques and results for dealing with tor- 
sional rigidity which may be brought to bear on problems involving L 1 -norms 
of exit time and vice-versa. We provide an example concerning the fundamen- 
tal result of [Serrin, 1971] in the section 6 below. 

16.4.2. Dirichlet spectrum for domains with compact 
closure in complete Riemannian manifolds and 
exit time moments 

Suppose that M is a complete Riemannian manifold, D C M a smoothly 
bounded domain with compact closure. Let r be the exit time of Brownian 
motion from D. For A € spec(Z)), let £\(l) denote the projection of the con- 
stant function 1 on the eigenspace of the Dirichlet Laplacian corresponding to 
A. Set 

4 = / |£a(1)| 2 ^ (4.6) 

JD 

where dg is the volume form associated to the metric g. Let 

spec*(D) = {A e spec(-D) : a\^ 0} (4.7) 

and define 

vp(D) = {a\ : A € spec* (£>)}. (4.8) 

Then vp(Z)) describes how the volume of the domain D is partitioned amongst 
eigenspaces and, in particular, X^Aespec(£») a \ = Vol (D). Moreover, denoting 
the Z^-norm of the &th moment of the exit time by 

PVlili = [ ® x \T k ]dg, (4.9) 

JD 

we have (cf [McDonald, (to appear)]) 

||e*[t*]||i = r(* + i) Yl a l{\) < 4 - 10 ) 

Aespec*(D) ^ ' 
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where F is the gamma-function (in fact, (4.10) holds for all real k > 0). A 
straightforward computation gives the estimate 

Estimate (4.11) holds for arbitrary compact manifolds with nonempty bound- 
ary and suggest that in this context, the L 1 -norms of the exit time moments 
behave like the reciprocal of the principal eigenvalue. This observation is con- 
sistent with the relationship between Theorem 7 and Theorem 9, as well as 
with the results of Theorem 8. The same relationship appears for a great num- 
ber of comparison geometry results and will be developed below (cf Theorem 
22). That the relationship holds also provides a means of studying the behavior 
of the first Dirichlet eigenvalue using techniques developed for studying first 
exit time moments. We provide an example: 

Let p£>(t, x,y) be the heat kernel associated to D, {<j>\} a complete set of 
orthonormal eigenfunctions for the Dirichlet Laplacian. Write 

p D (t,x,y) = J2cf> x (x)(f> x (y)e- Xt . (4.12) 

Let dg be the volume form and, as in (2.15), let uo(t, x) be defined by 

u D (t,x) = / p D (t,x,y)dg(y). (4.13) 

Jd 

Then un(t, x) is the distribution of the exit time 

u D (t,x) = ¥ x (r>t) (4.14) 

and using (4.12), (4.13) and (4.14) we see that the first Dirichlet eigenvalue 
characterizes large deviations of r : 

F x (r>t) ~ Ce~ Xlt . (4.15) 

When D is a small geodesic ball of radius e, this observation and the corre- 
sponding analysis of the small e asymptotics of the first exit time led Karp 
and Pinsky to the small E asymptotics for the first Dirichlet eigenvalue. More 
precisely, 

THEOREM 10 (cf[Karp, 1987]) Suppose that(M,g) is a Riemannian man- 
ifold, that x £ M and that Bm{x,e) is a geodesic ball of radius e centered 
at x. Let Ai(a;,e) be the corresponding first Dirichlet eigenvalue. Then, as 
e — » + , there is an expansion of the form 

\\(x, e) = CqE~ 2 + Ci<S 

+ c 2 [\Rjkim\ ~ |Ricy | 2 + 6A5] e 2 + 0(e 4 ) (4.16) 



370 RECENTS AD VANCES IN APPLIED PROBABILITY 

where the constants Ci depend only on dimension, S is the scalar curvature at 
x, |RiCy| is the norm of the Ricci curvature at x, \Rjkim\ * s the norm of the 
Riemann curvature at x and A is the Laplace operator. 

If M is compact, one can consider the asymptotics of the Dirichlet spectrum 
for the complement of a small ball, M \ Bm{%, £)• This problem was studied 
probabilistically by Kac [Kac, 1974], who considered the first time Brownian 
motion hits the small ball and obtained partial results on the asymptotics of the 
jfth eigenvalue. These results were refined by Chavel and Feldman [Chavel, 
1988]. The problem continues to define an active area of research. 

Returning to the study of moments, we will write 

mspec(£) = {||EV]lli}r=o- 

Then (4.10) says that the set spec* (I?) U vp(£>) determines the set mspec(D). 
It turns out that the converse is also true: 

THEOREM 1 1 (cf [McDonald, (to appear)]) Suppose D, D' are smoothly 
bounded domains with compact closure in M. Then 

mspec(D) = mspec(Z?') =» spec*(D) = spec*(D / ) 

andvp(D) = vp(£>'). (4.17) 

The proof ofthis result uses the solution ofthe classical Stieltjes moment prob- 
lem [Akhiezer, 1965] and suggests that the techniques developed in the context 
ofthe moment problem might be useful in the context of spectral geometry. 

As a corollary of Theorem 11, we obtain that the first Dirichlet eigenvalue 
is determined by mspec(D). 

COROLLARY 12 ([McDonald, (to appear)]) Let D C M be a smoothly boun- 
ded domain with compact closure. Let spec* (D) = {i/i}^ 1 enumerate ele- 
ments of 'spec* '(D) in increasing order. Then 

v x = sup \v > 0;Iimsup(*//2) n L I 1 < oo} (4.18) 

t n— >oo 1(71+1) J 



and 

< = limsup(^/2) n l ^. L " >"\ (4.19) 



4 ■ tS*^^ 



In fact, from Corollary 12 and (4.10) it is clear that the tail ofthe moment 
spectrum gives a recursion for the elements of spec*(£>) and vp(Z>) (cf [Mc- 
Donald, (to appear)]). 

Given (4.19), it is clear that the exit time moments are closely tied to the in- 
tegrals of normalized eigenfunctions. Such objects have received attention for 
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a variety of reasons, including their relationship to asymptotics for the spectral 
counting function, the asymptotics of the spectral heat function, and the heat 
content asymptotics of D. Focussing our attention on heat content, we recall 
the neccesary facts: 

Let un(t,x) be as defined in (4.13). Then uo{t,x) is the solution of the 
initial value problem 

1 » dun /n . _ 

-Au D = — on (0, oo)xD 



ud(x,0) 



\1 ifx<=D 
\0 ifxedD 
u D (t,x) = OifxedD. (4.20) 

Let q(t) be the heat content of D at time t : 

q{t) = f u D (t,x)dg. (4.21) 

JD 

We note that q(t) is the Laplace-Stieltjes transform of the spectral heat func- 
tion, h : R+ -> R+ defined by 

h(a) £ a\ 

Aespec(D), A<<r 

where a\ is as in (4.6). Using a Tauberian theorem, van den Berg and Watson 
have determined the first two terms in an asymptotic expansion of h(<r) and 
used this to obtain an estimate on the rate at which the a\ converge to zero 
[van den Berg, 1999A]. 

It is a theorem of van den Berg and Gilkey [van den Berg, 1994] that q(t) 
admits a small time asymptotic expansion: 



oo 



q(t) ~ 2^q n t n/ " (4.22) 

n=0 

where the coefficients q n are locally computable geometric invariants of D 
(that is, every q n is given as an integral over the boundary of the domain or 
an integral over the interior of the domain of a finite number of derivatives 
of components of the Riemannian metric). We will refer to the coefficients 
occuring on right hand side of (4.22) as the heat content asymptotics ofD and 
we write 

hca(£>) = {ftJSLo. (4.23) 

We note that the invariants hca(£>) are not spectral. 
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The probabilistic study of heat content is by now well developed in a variety 
of contexts (piecewise smooth domains, fractals domains, etc) and the identi- 
fication of a number of the coefficients in the expansion has been carried out 
(cf [van den Berg, 1994A], [van den Berg, 1994]; cf [Gilkey, 1999] for a re- 
cent survey of results concerning heat content). For example, it is known that 
the first coefficient is given by the volume of the domain (this is clear from 
(4.20) and (4.21)), while the second coefficient is given by a constant multiple 
of the area of the boundary of the domain, suggesting that heat content might 
be useful in the study of isoperimetric phenomena (cf section 5.1 below and 
[Burchard, 2002]). For polygonal domains, it is known that the asymptotics 
terminate after 3 terms (cf [Burchard, 2002]); it would be interesting to know 
whether similar phenomena exist in higher dimensions. 

From Corollary 12 it is clear that heat content is closely related to mspec(£>). 
We have: 

THEOREM 13 ([McDonald, (to appear)]) Let M be a complete Riemannian 
manifold, D C M a smoothly bounded domain with compact closure. Then 
mspec(Z)) determines q(t) (and thus hca(Z)),) 

Using Theorem 4.4 and Theorem 4.5, we see that spec*(D) U vp(Z3) de- 
termines hca(Z)), a geometric result proved via the analysis of a probabilistic 
object (mspec(£>)). This result suggests that the invariants mspec(Z)) may be 
useful tools in studying questions involving the fine structure of isospectral do- 
mains. To formulate a more precise statement, we again recall the basic facts: 

In his often cited 1965 paper, Mark Kac popularized a fundamental prob- 
lem of planar spectral geometry: Does spec(£>) determine D up to isometry? 
The problem was settled (at least in the piecewise smooth category) by Gor- 
dan, Webb, and Wolpert [Gordon, 1992], who constructed a pair of nonisomet- 
ric, isospectral planar polygons. In 1994 Buser, Conway, Doyle and Semmler 
[Buser, 1994] gave an elegant and straightforward construction of families of 
isospectral nonisometric planar polygonal pairs (we will abbreviate reference 
to such pairs by INIPP). Their constructions include a simplified version of 
the example of [Gordon, 1992] as a special case, as well as the first example 
of a pair of isospectral planar domains all of whose normalized eigenfunctions 
agree at a pair of distinguished interior points (so called homophonic domains). 
These examples are generated by a "seed" triangle together with a collection 
of congruent "reflection progeny" triangles produced by a sequence of reflec- 
tions across edges. In particular, the construction is essentially combinatorial 
and by focussing on the vertices and edges of the corresponding triangles, the 
construction can be taken to occur in the category of planar graphs. 

One might summarize the work of [Buser, 1994] by saying that, for piece- 
wise smooth planar domains, the Dirichlet spectrum provides an incomplete 
collection of geometric invariants. Such a summary suggests that to construct 
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a good collection of geometric invariants, one might be well served by finding 
invariants which distinguish INIPPs. In [McDonald, (to appear)A] we show 
that in the category of weighted graphs and their associated combinatorial 
Laplacians, there exist natural weighted graph analogs of INIPPs which are 
isospectral but not isomorphic, and that these graph pairs are distinguished by 
their heat content asymptotics (and thus by their moment spectra). The natural 
conjecture is that heat content distinguishes the isospectral domains of [Buser, 
1994], 

16.4.3. Spectral gap and coupling 

In the previous two subsections we have considered results which involve 
exit time moments of Brownian motion and the Dirichlet spectrum. In this sec- 
tion we consider estimates of the spectral gap obtained via coupling methods. 
We begin by recalling the requisite material involving spectral gaps. 

For clarity of exposition, suppose that a,ij , b\ : MP — * K are smooth with 
(dij) positive definite as an n X n matrix. Suppose there is a smooth function 
V satisfying 

2 fexp(V(x))dx < oo. 



Let 



i,3 J i 

Let n be the measure defined by 

d = exp(F(x)) dx 
f exp(V(x))dx 

and note that L is symmetric with respect to the measure \i. Let ||/|| be the 
normof/inL 2 (R n ,dfi). 

Let Pt = e~ tL be the heat operator for L, Fixing / in the domain of L, for 

e small we have 



Ptf - j fdfA < If- f /dJ e _ef (4.25) 

for all t > 0. We are interested in studying the maximal e for which (4.25) 
holds (ie the rate at which Ptf converges to J fdfj, in L 2 ). To this effect, we 
define the spectral gap associated to L by the variational principle 

sg(L) = inf j-</, !/);/€ L 2 (R n ,d/*), J fd f x = 0, \\f\\ = l\ . (4.26) 
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Under mild assumptions on L (eg the Dirichlet form is regular), it follows that 
for all e<sg(L), (4.25) holds. 

It should be clear that the development sketched above can be carried out in 
the context of ambient spaces other than R n . It is also the case that analogous 
statements hold for processes which are not diffusions (eg general reversible 
Markov processes [Chen, 1994A]). 

If we restrict our attention to the Dirichlet Laplacian on a compact 

Riemannian manifold, it is clear that the spectral gap coincides with the first 
nonzero Dirichlet eigenvalue, the variational principle being equivalent to the 
Raleigh quotient. Thus, general results for estimates of the spectral gap give 
rise to estimates for principle eigenvalues. It is in this context that we develop 
the notion of coupling. 

Coupling was originally introduced by Doeblin [Doob, 1983] to study the 
rate of convergence to stationarity of a Markov chain. Lindvall is responsible 
for adapting coupling techniques to Brownian motion (cf [Linvall, 1983], [Lin- 
vall, 1986]). There are a number of surveys of coupling techniques available 
(eg [Brin, 2001]) as well as a text ([Linvall, 1992]). We recall the basic facts: 

Again, for clarity of exposition let Xt be a diffusion process on R n with 
generator the operator L given in (4.24). By a coupling for the process Xt 
we mean two copies of the process, denoted by (X},X 2 ), which are taken 
to begin at different points. More precisely, the processes X} and X? have 
the same distribution as Xt and the processes X}, X?, and (X/,X t 2 ) are all 
Markov with respect to the filtration generated jointly by X} and X t 2 . Define 
the coupling time, T, by 

T = inf{< > : X} = X?}. (4.27) 

Suppose that 

1 it is possible to construct X 1 and X 2 such that for all t > T, X} = Xf. 

2 there is a constant v such that for generic starting points x\, X2, F(T > 
t\Xl = x lt Xi = x 2 )~e-". 

Then one can prove that v is a lower bound for sg(L). 

Thus, to apply coupling to estimate the sepctral gap we must check the above 
and arrange for v to be close to sg(L) (ie the coupling should be efficient in 
the language of [Brin, 2001]). This program has been carried out in a number 
of interesting geometric contexts in which it produces general lower bounds 
on the spectral gap (cf [Chen, 1997], [Chen, 1994]). We restrict our attention 
to examples of special interest; those involving the Dirichlet spectral gap and 
coupling. 

Suppose that M is a complete Riemannian manifold, D C M a smoothly 
bounded domain with compact closure. Let {</>A}A€spec(D) be a complete or- 
thonormal family of eigenfunctions for the Dirichlet Laplacian and write the 
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heat kernel as 



PD{t,x,y) = 






e- Al *. 



We consider the Dirchlet spectral gap A2 — Ai . 

There is a long history of estimates for the Dirichlet spectral gap in terms 
of the underlying geometry of the domain. When M = W 1 and D is a convex 
regular domain with diameter d, Singer, Wong, Yau and Yau [Singer, 1985] 
established 

X2-X1 > ^- (4.28) 

On the other hand, when D is a rectangle it is easy to check that 



A2-A1 > -jp (4.29) 

and thus one expects improvements of the [Singer, 1985] estimate (4.28). For 
Euclidean domains as above, such an improvement was given by Yu-Zhang 
[Yu, 1986] who established the estimate 

A2-A1 > ^. (4.30) 

Realizing that the Dirichlet spectral gap can be considered as the first eigen- 
value of Brownian motion conditioned to remain forever in the domain, R. 
Smits [Smits, 1996] gave a second (probabilistic) proof of the estimate (4.30). 
Combining the ideas of Smits, comparison and the powerful general estimates 
of [Chen, 1997], Wang has considered the analog of the problem for general 
ambient manifolds. His recent results [Wang, 2000] recover and improve the 
known results involving Dirichlet spectral gaps and suggest that the technique 
will continue to produce improvements and new directions for further research. 

16.5 Isoperimetric Conditions and Comparison Geometry 

The Rayleigh Conjecture (Theorem 7 above) was established in the early 
twentieth century by Faber and by Krahn who both realized that the feature of 
fundamental importance in establishing a proof is the isoperimetric property 
of planar domains (among planar domains of fixed area, a disk has minimum 
perimeter). The first rigorous proof of the isoperimetric property for Euclidean 
domains was given by Steiner in the nineteenth century using rearrangement 
techniques pioneered for just this purpose. That the isoperimetric property 
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holds for domains when the ambient space is a Euclidean sphere or hyperbolic 
space was established by Schmidt [Schmidt, 1943]. Using Euclidean space, 
Euclidean spheres, and hyperbolic space as models we can study analogs of 
the isoperimetric property and other geometric phenomena in more general 
ambient spaces. 

16.5.1. Isoperimetric phenomena and moments of exit 
times 

We begin by formalizing our notion of a model: 

DEFINITION 14 Let k be a real number. The constant curvature space form 
with curvature K, denoted M Ki is 

1 A sphere in Euclidean space if k > 0, 

2 Euclidean space ifK = 0, 

3 A hyperbolic space if k, < 0. 

DEFINITION 15 Suppose that M is a Riemannian manifold. We say that M 
satisfies an isoperimetric condition with constant curvature comparison space 
M K if for allBorel D C M, 

Vol(D) = v =► Area M (dD) > Axta MK { dB ) 

where B C M K is a geodesic ball of volume v and "Area " denotes the Minkowski 
measure induced by the corresponding Riemannian metrics. 

We note that there is a great deal of literature devoted to determining precise 
regularity requirements for isoperimetric phenomena. For the purpose of this 
section, all domains are taken to be smoothly bounded unless otherwise indi- 
cated. In this case, all reasonable definitions of area will coincide. 
We can now state the result of Faber-Krahn: 

THEOREM 16 Suppose that M is a Riemannian manifold which satisfies an 
isoperimetric condition with constant curvature comparison space M K . Then, 
for all D C M, 

Vo\(D) = v =» Xi(D)>X 1 (B) 

where B C M K is a geodesic ball of volume v and \\ is the first Dirichlet 
eigenvalue. 

The proof of Theorem 16 uses symmetric rearrangement. As this will play a 
role in much of this section, we recall the basic facts. 
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Given D C M, a Borel set of finite volume, we denote by D* the ball in 
M K (centered at an appropriate origin) of volume equal to that of D. Suppose 
that /:£)—> [0, oo) and suppose that the positive level sets of/ all have finite 
volume. Suppose // > 0, and let 

D(n) = {yeD: f{y) > y.}. 

We define the spherically symmetric decreasing rearrangement of/, denoted 
f*:D*—> [0, oo), as the radial function 

f*(\x\) = sup{ M : x e D((x)*} 

It follows from the definition that / and /* are equimeasurable. It follows 
from the co-area formula (see [Chavel, 1984], [Chavel, 2001]) that symmetric 
rearrangement is nonincreasing for the if 1 -Sobolevnorm. Applying this to the 
Rayleigh quotients which compute the first Dirichlet eigenvalue, we have 

f D \Vf\ 2 dg f D .\Vf*\ 2 dg K 



f D f 2d 9 - f D *(f*) 2 dg K 



from whence Theorem 16 follows. 

The results of the previous section (Theorem 8, (4.11)) suggest that the IP- 
norms of the exit time moments behave like the reciprocal of the principal 
Dirichlet eigenvalue. This suggests the following analog of the Faber-Krahn 
result: 

THEOREM 17 Suppose that M is a Riemannian manifold which satisfies an 
isoperimetric condition with constant curvature comparison space M K . Let r 
be the first exit time ofBrownian motion. Then, for allD C M,for all k 6 N, 

Vol(D) = v => ||E*(t£)|| p < ||E*(T§)||p 

where B C M K is a geodesic ball ofvolume v. 

This theorem is essentially due to Aizenman and Simon in the Euclidean case 
[Aizenman, 1982] (see also [Kinateder, 1998]). The general result can be 
found in [McDonald, 2002]. 

In fact, the argument of Aizenman-Simon establishes a more general con- 
clusion than the estimate on moments. Their precise theorem is 

THEOREM 18 ([Aizenman, 1982]) Suppose that D is a domain in W 1 of fi- 
nite volume and let r be the exit time ofBrownian motion. Suppose that 
f : [0, oo) — > R is nonnegative and nondecreasing. Then, ifD* is the ball 
centered at the origin with the same volume as D, we have 

E x [f(r D )] < E [/(t o .)]- 
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The proof of this result uses a deep result of Brascamp, Lieb, and Luttinger 
[Brascamp, 1974] involving symmetric rearrangement of multiple integrals. 
The result of Brascamp, Lieb and Luttinger and the theorem of Aizenman and 
Simon have been further refined by Burchard and Schmuckenschlager. Using 
rearrangement techniques at the level of Brownian paths and a Trotter product 
formula, they prove 

THEOREM 19 (cf [Burchard, 2002]) Let M K be a constant curvature space 
form, D C M K a Borel set of finite volume, D* an open disk of volume equal 
to that ofD. Let tq be the exit time of Brownian motion and let U£>{t, x) = 
1? x (td > t). Then, for all t > 0, the exit time from D is dominated by the exit 
time from D* in the sense that for every convex increasing function F, 

J F(u D (t,x))dx < [ F(u D *(t,x))dx (5.1) 

Jd Jd- 

where dx is uniform measure. In particular, ifx* is the center of the disk D* , 
then 

sup uo(t,x) < U£>*(t,x*). (5.2) 

xeD 

Equality in (5.1) when F(ui>(t,x)) is nonconstant or equality in (5.2) occurs 
if and only if there is a ball B where D\B has zero volume and B\D is polar. 

16.5.2. Comparison and exit time moments 

The structure of Theorem 16 can be abstracted to the following form: given 
a geometric restriction on a Riemannian manifold (ie it satisfies an isoperimet- 
ric condition), the geometry is further constrained (ie there is a lower bound 
on the principal eigenvalue of any domain of a given volume). Such structure 
defines those results which comprise the field of Comparison Geometry. There 
are a number of such comparison results which involve probability. 

We begin with a result of Debiard, Gaveau and Mazet [Debiard, 1976] who 
use path properties of Brownian motion to prove 

THEOREM 20 (cf [Debiard, 1976]) Suppose that M is a Riemannian mani- 
fold with sectional curvatures denoted by K. Suppose that xq E M and that 
Po is a positive constant that is less than the injectivity radius ofM at xq, Let 
Bm(xo,Po) be the geodesic ball of radius po centered at xo. Let B K (x' ,po) 
be the geodesic ball of radius po in the constant curvature space form M K cen- 
tered at some origin x' . Letps(t, xq, x) be the heat kernel on Bm{xq, po) and 
denote by p K (t, r) the heat kernel on B^x'q, po), where r is the distance from 
the origin x' Q to the second variable. Then, W > 0, x 6 Bm{%0, po), 

K<k =*> p B (t, xq, x) <p K (t, dist(x Q ,x)) (5.3) 

K>k =*• p B (t,xo,x)>p K (t,dist(x ,x)). (5.4) 
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There is a corresponding result for Ricci curvature due to Cheeger-Yau 

THEOREM 21 (cf [Cheeger, 1981]) Suppose that M is an n-dimensional Rie- 
mannian manifold with Ricci curvatures denoted by Ric. With the notation of 
Theorem 20, Vt > 0, x € Bm(xq,Po), 

Ric > (n - 1)k => p B (t,x ,x) >p K (t,dist(x ,x)). (5.5) 

From the heat kernel comparison theorems (Theorem 20 and Theorem 19) 
and standard comparison techniques (eg Bishop's volume comparison [Chavel, 
1984]), it is possible to derive a number of comparison results for norms of 
exit time moments. For example, the following is an analog of a well-known 
comparison result of Cheng [Cheng, 1975]: 

THEOREM 22 Suppose that M is an n-dimensional Riemannian manifold 
with sectional curvatures denoted by K and Ricci curvatures denoted by Ric. 
Let t denote the first exit time ofBrownian motion. With the notation of Theo- 
rem 20, for all p and all m, 

K<k =► ||E«(rff)|| p <||E*(r")||p (5.6) 

K>k =► ||E*(t£ )|| p > ||E*(0|| P (5.7) 

Ric>(n-1)« =* ||E*(r£)|| p > ||E*(OH P (5.8) 

16.5.3. Comparison and transience/recurrence 

In addition to the above results concerning the relationship of exit time to 
isoperimetric phenomena and comparison geometry, there is a deep and beauti- 
ful connection between isoperimetric and comparison phenomena for noncom- 
pact Riemannian manifolds on the one hand and the transience or recurrence 
of Brownian motion on the other. There is an excellent recent survey of this 
material [Grigoryan, 1999] and we remark that the deep work of Varopoulos 
has been of fundamental importance, especially in the context of groups (cf 
[Varopoulos, 1992] and references therein). We present a few of the more 
striking results. Let M be a complete non-compact Riemannian manifold 
and let Xt denote Brownian motion on M. Recall, 

DEFINITION 23 Brownian motion on M is transient if for some open set U 
and some point x, Brownian motion eventually leaves U with positive proba- 
bility: 

P X {3T : \/t > T, X t i U} > 0. 

It is a classical result that Brownian motion in M. n is recurrent for n < 2 and 
transient for n > 3. Straightforward comparison results allow one to extend 
this to spaces with variable curvature: In dimension 2 all nonnegatively curved 
manifolds have recurrent Brownian motion while in dimension 3 and above, all 
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nonpositively curved manifolds have transient Brownian motion (cf [Kendall, 
1987] and references therein). 

It is a result of classical potential theory (cf [Doeblin, 1938]) that transience 
of Brownian motion is equivalent to M being non-parabolic: 

DEFINITION 24 We say that a complete manifold M is non-parabolic if M 
admits a non-constant positive superharmonic function. Otherwise, we say 
that M is parabolic. 

A recent typical result tying transience of Brownian motion to the geom- 
etry of a non-compact manifold involves establishing sufficiency conditions 
for parabolicity in terms of volume growth (for a survey containing results on 
volume growth and geometry, see [Li, 2000]): 

THEOREM 25 ([Grigoryan, 1999], [Karp, 1982], [Varopoulos, 1983]) Sup- 
pose that M is complete and that x € M. Let B(x, p) be 
the ball of radius p centered at x. Suppose that 



F 

Jo 



/o Vol(B(x,p)) 
Then M is parabolic. 



dp = oo. (5.9) 



Similar results hold for manifolds which admit a Faber-Krahn type inequal- 
ity with isoperimetric function A : 

DEFINITION 26 Suppose that A : (0, oo) — > R is a positive decreasing func- 
tion. We say that a complete manifold M satisfies a Faber-Krahn type inequal- 
ity with isoperimetric function A ifforallprecompact Q. C M, 

Area(dft) > A(Vol(«)). (5.10) 

The following is a theorem of Grigoryan [Grigoryan, 1994]: 

THEOREM 27 ([Grigoryan, 1994]) Suppose that M is complete and that for 
all precompact open sets of large enough volume, M satisfies a Faber-Krahn 
type inequality with isoperimetric function A satisfying 



f 

Jo 



dv < oo. (5.11) 



/o v 2 A(v) 

Then M is non-parabolic. 

One can also estimate the heat kernel [Grigoryan, 1994]: 

THEOREM 28 ([Grigoryan, 1994]) Suppose that M is complete and that M 
satisfies a Faber-Krahn type inequality with isoperimetric function A. Fix x G 
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M, to > 0, 5 € (0, 1) and suppose that there exists a non-negative function 
$ : [to, oo) — * R such that 

*/ s 2 

*(*o) < 



5p M {2t Q ,x,x) 

/■*(*) 1 

/ -w-rdt; = (1 - J)(« - 1 ). (5.12) 



Then for all t > to, 



p M (2t,x,x) < ^T^y- (5.13) 



These results follow via estimates for capacity and can be further refined 
[Grigoryan, 1999 A]. 

Estimates for the long time behavior of the heat kernel on a complete Rie- 
mannian manifold can often be parlayed into information concerning the ge- 
ometry of the manifold at infinity (to make this precise, see section 6.2 below). 
There are a number of excellent surveys of this theme available (cf [Grigoryan, 
1999B]). That we consider a single recent result should in no way be taken to 
represent activity in the field; the associated literature is volumnious. 

Intuitively, given a ball B(x,r) of radius r centered at x € M, one expects 
that the faster the volume Vol(B(x,r)) grows as a function of the radius, the 
faster the heat kernel p(t,x,x) should decay. In fact, it is possible to give 
a bound for the decay of the heat kernel in terms of volume growth. More 
precisely, Barlow, Coulhon and Grigoryan prove [Barlow, 2001] 

THEOREM 29 Let M be a geodesically complete noncompact Riemannian 
manifold with bounded geometry and let To > be its injectivity radius. Sup- 
pose that for all points x € M and all r > ro, 

Vol(B(x, r)) > v(r) 

where v : [ro, oo) — > R + is a continuous positive strictly increasing function. 
Then, for all t > to = r$, 

C 
sup p{t, x, x) < —-— (5.14) 

X€M l\ct) 



where 7 is defined by 

n(t) 

u(ro) 

where v~ is the inverse function, and c, C are positive constants. 



n{t) 
t — to = I v (s)ds 

Jv(ro) 
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To prove the theorem, the authors first note that their volume growth hypothesis 
implies a Faber-Krahn type inequality 

\i(D) > A(Vol(D)) (5.15) 

where D is a large enough precompact set, Ai is the principle Dirichlet eigen- 
value and A is a function determined by v(r). They then establish that the 
inequality (5.15) is equivalent to the required heat kernel estimates. 

16.6 Minimal Varieties 

For the purpose of this section, minimal varieties are geometric objects 
which arise as solutions to geometric variational problems. In this section, 
we review minimal varieties with ties to probability. 

We begin with the proto-typical example given by the St. Venant Torsion 
Problem (Theorem 9). In this case the minimal varieties are domains which 
maximize the L 1 -norm of the first exit 

time moment of Brownian motion, given a volume constraint. In the cat- 
egory of smooth domains, that maximizers must be spheres follows from the 
work of Serrin [Serrin, 1971]. In more detail, suppose we consider the collec- 
tion of all smoothly bounded domains DcR 2 with compact closure: 

V = {DcR 2 :D compact, 3D smooth}. (6.1) 

This space has a natural smooth structure with the tangent space at each D € Z> 
identified with smooth functions on dD : TpV ~ C°°{dD). Consider the 
smooth function F : T> — > M. definedby 

F(D) = \\E x [r D ]h (6.2) 

Smoothly perturbing D, we obtain a characterization of critical points of F : D 
is critical for F if one can solve the overdetermined boundary value problem: 

Au + 1 = on D 
u — on 3D 

-£■ = coadD (6.3) 

au 

where gj; is the normal derivative along the boundary and c = — Ar °Lp\ is a 
constant. Serrin's result states that it is possible to solve the overdetermined 
boundary value problem (6.3) if and only if the domain is a ball. As pointed 
out by Serrin, his result holds when one replaces the Laplace operator by the 
Laplace operator with certain types of lower order nonlinearity. Serrin's result, 
as well as his technique, led to a great deal of progress in nonlinear PDE (cf 
[Gidas, 1979], [Gidas, 1981], [Berestycki, 1991], [Berestycki, 1993]). 
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If one considers in place of R 2 a constant curvature space form and in place 
of F the L p -norm of the fcth moment of the exit time, one can run the same 
variational argument, obtaining a characterization of critical points by overde- 
termined boundary value problems(with nonlinearity in the boundary condition 
as opposed to the operator). It turns out that the boundary value problems have 
solutions if and only if the domain is a ball (cf [McDonald, 2002]), thus char- 
acterizing the minimal varieties for the L p -norm of the exit time moments for 
smooth domains in constant curvature space forms. For Borel sets, the case of 
equality is settled by Burchard and Schmuckenschlager [Burchard, 2002] (cf 
Theorem 19). 

In addition to controlling volume there are a number of other geometric con- 
straints which one can impose on domains in an ambient space when studying 
LP -norms of exit time moments of Brownian motion. One such constraint, im- 
portant in a number of applications, involves fixing the inradius of a domain. 
We recall the definition: 

Definition 30 Suppose that M is Riemannian and that D C M. Then in- 
radius ofD is the extended real number 

sup{r € R : Sx € D, B(x, r) C D} 

where B(x, r) is the ball of radius r centered at x. 

Using conformal techniques, the following result is contained in the work of 
Banuelos, Carroll, and Housworth [Banuelos, 1998]: 

Theorem 3 1 Suppose Del and let r be the first exit time of Brownian 
motion. Then 

inradius(D) = 1 =► HE^td)^ < ||E*(t s )||oo 
where S is the infinite rectangular strip S = {(x,y) : < y < 1}. 
There are a variety of related recent results for unbounded domains. 

16.7 Harmonic Functions 

Let M be a complete Riemannian manifold, A the Laplace operator acting 
on functions on M. Recall, a function / : M — ► R is harmonic if it satisfies 

A/ = 0. 

Equivalently, / is harmonic if and only if/ is stationary for the Dirichlet form 

E{f) = I \Vf?dg. (7.1) 

Jm 
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It is well known that there is a deep relationship between Brownian motion 
and harmonic functions. An example of this relationship is given by the Kaku- 
tani's representation of the solution to the Dirichlet problem (2.6) in terms of 
Brownian motion (2.7). In this section we survey such probabilistic 

representations and their connection to the geometry of noncompact, com- 
plete manifolds. 

The representation (2.7) is but one instance of an extensive body of work 
devoted to the representation of harmonic functions via boundary geometry. 
Another such representation of harmonic function is given by the Poisson ker- 
nel. More precisely, suppose for concreteness that D is a smoothly bounded 
domain with compact closure in R n . Let G(x, y) be the Green's function for 
D, dy surface measure on 3D. Define a functions on D by 

u(x) = / k y (x)d(j,(y) (7.2) 

JdD 

where k y (x) = — ^r^ is the Poisson kernel and d/j,(y) = f(y)dy for 
some positive function / on the boundary of D. Then (7.2) defines a positive 
harmonic function: the solution of the Dirichlet problem with boundary data 
/. In fact, allowing the measure djj, to be supported and finite on 3D (with no 
other constraints) provides a representation of every positive harmonic function 
on£>. 

It was an idea of Martin [Martin, 1941] that such a representation should be 
possible for bounded but otherwise arbitrary domains in R n , given the appro- 
priate definition of "boundary." This idea came to play an important role in po- 
tential theory, both from a probabilistic and from an analytic point of view. The 
material was developed by both schools (cf [Dynkin, 1965], [Dynkin, 1982], 
[Doeblin, 1938], [Pinsky, 1995]). 

16.7.1. Martin boundaries 

To define the Martin boundary, let D C M. n be an arbitrary bounded domain 
and fix p G D. Let G be the minimal Green's function for D and define 

Hy{x) = Gfcy) (7 ' 3) 

Let {yi} be a nonconvergent sequence of points in D and consider the har- 
monic functions hi(x) = h Vi {x). Then the sequence {hi} is uniformly boun- 
ded on compact subsets ofD and for all i, hi(p) = 1. By Harnack's inequality, 
there exists a convergent subsequence, denoted {h^}, which converges uni- 
formly on compact subsets of D to a positive harmonic function h(x). We 
call the sequence of points {yi} a Martin sequence. We say that two Martin 
sequences are equivalent if and only if the have the same limiting harmonic 
functions. 
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Definition 32 The Martin boundary of A denoted M, is the collection of 
equivalence classes of Martin sequences. We say that a point [h] € M is 
minimal if the corresponding harmonic limit h satisfies 

If h' is a positive harmonic function on D and h! < h, then h' = ch for some 

c e (o, i]. 

The minimal Martin boundary of A denoted M.q, is the collection of all min- 
imalpoints. 

The results of [Martin, 1941] contain the Martin representation theorem: 

Theorem 33 Suppose D C R n is bounded and that Mq is the minimal Mar- 
tin boundary ofD. For y € Mo, let k y {x) denote the corresponding positive 
harmonic function. Then for each positive harmonic function u there exists a 
unique finite measure \i u supported on Mo such that 

u(x) = / k v (x)dfi u (y). (7.4) 

Conversely, for every finite measure supported on Mo, (7.4) defines a positive 
harmonic function on A 

In providing the above representation theorem, the Martin boundary provides 
a means of employing analytic techniques to study the geometry of the under- 
lying domain. To see that this is the case, note that there is a natural metric 

topology onDuM (cf[Pinsky, 1995]) for which M becomes a compacti- 
fication of A When D is sufficiently regular, this coincides with the Euclidean 
compactification of D. We have the following theorem of Hunt-Wheedon: 

Theorem 34 (cffHunt, 1970]) Suppose D C R n . Suppose that for each 
y G dD, there isaball, B(y), centered at y such that B(y) D dD is the graph 
of a Lipschitz function. Then the Martin boundary of D, the minimal Martin 
boundary ofD and the Euclidean boundary ofD all coincide. 

This result strongly suggests that the ideas surrounding the notion of a Mar- 
tin boundary might be useful in the study of the geometry of complete non- 
compact Riemannian manifolds near their "boundary." Obviously, the first 
step in such a program is to establish precisely what is meant by "geometry of 
the boundary" in this context. There is a natural geometric approach: 

Definition 35 Let M be a complete Riemannian manifold. Given two 
geodesic rays, 71 and 72, in M we say that 71 (t) and 72 (t) are asymptotic 
zy'dist(7i(£),72(£)) is a boundedfunction oft. 

It is clear that the notion of asymptotic defines an equivalence relation on 
the collection of geodesic rays. 
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Definition 36 Let M be a complete Riemannian manifold. We define the 
sphere at infinity, denoted S(oo), as the collection of equivalence classes of 
geodesic rays in M. 

There is a natural topology on M U 5(00) (the cone topology) and, with re- 
spect to this topology, 5(oo) gives a topological compactification of M. Given 
this, we refer to 5(oo) as the geometric boundary of M. 

To define the Martin boundary of a (class of) complete Riemannian mani- 
folds, we model the development on Definition 32 and its motivating discus- 
sion: 

Definition 37 Suppose that M is a complete Riemannian manifold admit- 
ting a Green 's function, G(x, y). Let p e M and, for x,y 6 M, let h y (x) be 
defined by (7.3). Let yi be a nonconvergent sequence of points, h Vi (x) the cor- 
responding harmonic functions, and h yi . a subsequence converging uniformly 
on compacts to a harmonic limit h(x). We call the sequence yi a Martin se- 
quence and we say that two Martin sequences are equivalent if they have the 
same harmonic limit. The Martin boundary of M is the collection of equiva- 
lence classes of Martin sequences. 

The question of which non-compact Riemannian manifolds should be stud- 
ied via Martin's approach was clarified by the seminal work of Yau [Yau, 
1975], [Yau, 1976]. Before proceeding, we need a fundamental definition: 

Definition 38 A manifold is said to have the Liouville property if it does 
not admit any nonconstant bounded harmonic functions. A manifold is said to 
have the strong Liouville property if it does not admit any nonconstant positive 
harmonic functions. 

Yau proved 

Theorem 39 ([Yau, 1975]) If M has nonnegative Ricci curvature, then M 
has the strong Liouville property. 

As the Martin boundary construction requires a rich structure of positive har- 
monic functions, Yau's result suggests that if Martin boundaries are to play 
a role in the study of the geometry of a non-compact Riemannian manifold, 
negative curvature will be necessary (cf also [Dynkin, 1965]). Given that the 
definition of the Martin boundary involves the existence of a Green's function, 
we must further restrict to a class of manifolds andmitting Green's functions; 
for example, manifolds with pinched negative curvature. This is the setting of 
the work of Anderson-Schoen [Anderson, 1985] who proved 

Theorem 40 (cf [Anderson, 1985]) Let M be a complete simply connected 
manifold with sectional curvature Km satisfying 

-b 2 <K M < -a 2 < 0. 
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Then there is a natural homeomorphism between the Martin boundary ofM 
and the geometric boundary ofM (the sphere at infinity). 

Since the publication of [Anderson, 1985], there has been an explosion in the 
study of harmonic functions on complete Riemannian manifolds, their cor- 
responding Martin boundaries, and the geometry of such manifolds at infin- 
ity. There are a number of informative surveys available (cf [Li, 2000]), most 
focussing on the geometric/function theoretic aspects of the material. There 
has also been a roughly concurrent probabilistic development of the mate- 
rial (a survey can be found in [Pinsky, 1995]), with results largely parallel- 
ing those obtained function theoretically (cf [Doeblin, 1938], [Dynkin, 1965], 
[Kifer, 1992], [Hsu, 1985], [Cranston, 1993] [Grigoryan, 1999] and references 
therein). Many such results can be inferred in the context of volume compar- 
ison and potential theory (cf section 5.3 above). Reference [Grigoryan, 1999] 
contains an excellent review of this material. We focus our remarks on material 
of independent interest. 

The probabilistic approach to Martin boundaries involves the study of the 
asymptotic behavior of Brownian motion and the existence, given appropriate 
assumptions on the ambient manifold, of almost sure limiting directions. Be- 
cause the probabilistic approach does not require the existence of a uniquely 
defined Laplace operator, it is possible to formulate a theory of Martin bound- 
aries for spaces which are not manifolds, for example simplicial complexes 
whose simplices are Euclidean (ie Euclidean complexes). Such a program has 
recently been carried out in part by Brin and Kifer, who prove the appropriate 
analog of the Anderson-Schoen result for Euclidean complexes [Brin, 2001]. 
This development provides a framework for a geometric function theory for 
large classes of singular spaces. 

Along a similar vein, given that the probabilistic development of Martin 
boundaries involves specific path properties of an underlying process, one 
might choose to fix the ambient manifold and investigate analogs of the Martin 
constructions for processes other than Brownian motion. This has recently 
been carried out by Chen and Song, who investigate the appropriate analogs of 
Martin boundary for symmetric stable processes [Chen, 1998]. 

16.7.2. Harmonic maps 

Suppose that M and N are Riemannian manifolds and f : M —> N. We say 
that /is a harmonic map if /is a stationary point of the Dirichlet form 

E(f) = f \\Df\\ 2 d 9 (7.5) 

JM 

where Df is the derivative of / (the induced map between tangent spaces). 
Given the relationship between Brownian motion and harmonic functions, it is 
natural to expect that probability will play an interesting role in the theory of 
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harmonic maps (this seems to have been first suggested by Eells and Lemaire 
[Eells, 1983]). This is indeed the case; an interesting survey of recent develop- 
ments can be found in [Kendall, 1998]. 

16.8 Hodge Theory 

Let M be an n-dimensional differentiable manifold, Cl k = Q k (M) the bun- 
dle of smooth fc-forms on M, d : fl k — ► Q. k+1 the exterior derivative (see 
section 2). The kth de Rham cohomology of Mis the quotient space of closed 
fc-forms by exact fc-forms: 

H k (M) - {u>€n k :dw = 0} 

HdR{M) ~ Wen k :3aen k ^da = uy m) 

The celebrated work of de Rham provides an isomorphism between H% R (M) 
and the kth Cech cohomology group of M with real coefficients. Thus, the 
spaces H k R (M), < p < n are topological invariants of M. In this section 
we survey probabilistic results which provide a means of studying H% R (M) 
under appropriate conditions on M. These results revolve around heat flow 
and the work of Hodge. 

Suppose that M is a compact Riemannian manifold and let A be the Laplace- 
Betrami operator acting on fc-forms on M. Let L 2 (Q k ,dg) be the L 2 comple- 
tion of the sections of the &-form bundle with respect to the induced volume 
dg. As discussed in section 2 and section 4, the Laplace-Beltrami operator is 
essentially self-adjoint and thus admits a unique self-adjoint extension to an 
operator on L 2 -sections of the 7c-form bundle. It is elliptic, and thus its kernel 
consists of smooth fc-forms which we suggestively denote by 

HLdgeiM) = {uen k :Auj = 0}. (8.2) 

Let P t = e~ tA be the heat operator acting on A; -forms. Let ui 6 Sl k . Then the 
solution of the Cauchy initial value problem 



d t oj - Au t = on (0, oo) x M 

u>o = uj (8.3) 



is given by 



0J t = PtuJo- (8.4) 



The operator P t is compact for alH > and admits a unique self-adjoint ex- 
tension to L 2 (£l k , dg). In fact, P t is a contraction for all t > which converges 
in norm to orthogonal projection on Hjj od (M) as t — > oo. 
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Suppose uiq € tt k and write uq — (I — Pt)^o + Ptu>o- Then 

(I - P t )u = / d s (P s u> )ds 
Jo 

= A J P s ojQds 
Jo 



= da + d*p (8.5) 



where 



a = d* ! P s u ds e ft fe_1 (8.6) 

Jo 

(3 = d f P s uj ds e Q k+1 . (8.7) 

Taking a limit, we obtain the celebrated Hodge decomposition: 

w = da + ef/3 + 7 (8.8) 

where a € Q fc_1 , /3 € O fc+1 and 7 € Hjj od (M),the decomposition being 

orthogonal. Given [u>] € H k R {M), let u>o represent [w] and write ujq as in 
(8.8). Then since u)q is closed, (3 = 0. Moreover, if d) = da -f- 7 is another 
representation of [a;], then 7 = 7. We conclude that the evolution ut = PtOJo 
smoothly deforms every representative of the class [u>] to its harmonic projec- 
tion 7, which is the element of minimal norm representing [w] as an element 
of H% R (M). Thus, we obtain the celebrated result of Hodge: 

Theorem 41 If M is a compact Riemannian manifold, there is a natural 
isomorphism between the de Rham cohomology of M and the harmonic forms 
ofM: 

H k dR {M) ~ H k Hodge {M). (8.9) 

The isomorphism is given by identifying each deRham class with its represen- 
tative of minimal norm. 

When M is not compact one cannot expect the operators Pt to be compact 
and the above approach must be modified if it is to have any hope of producing 
an analog of the Hodge theorem. To see how one might go about constructing 
an analog, consider the case of R n . The DeRham cohomology of ]R n is well 
known: 

« - ft r° 
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the zero dimensional cohomology being represented by constant functions 
which are not I? with respect to the measure induced by the volume form (ie 
Lebesgue measure). To produce a reasonable candidate for the Hodge Lapla- 
cian, we rescale the volume element appropriately: Fix t > and xq € M n and 
let pt(xo, x) be the fundamental solution of the heat equation on R n at time t : 

1 Is-sqI 2 

pt{x ,x) = — „ e « . 

(47Tt) 2 

Consider the heat kernel weighted measure 

d/j, — p t (xo,x)dx 

and let L 2 (R n ,d/x) be the corresponding weighted L 2 -space. The measure dp, 
induces an adjoint of the exterior derivative, denoted d* , and a corresponding 

Laplace-Beltrami operator A\x ' acting on the appropriately weighted L 2 -form 
bundles. Writing 

H k £ dge = kernel(A«) 
one can compute directly [Bueler, 1999] that 

In the context of Hodge theorems for finite dimensional Riemannian 
manifolds these ideas seem to have been introduced by Bueler [Bueler, 
1999] and further developed by Ahmed-Stroock [Ahmed, 2000]. We sketch 
the results of the latter. 

Suppose that M is a complete, oriented connected Riemannian manifold 
with Ricci curvature bounded below and the Riemann curvature operator 
bounded above. Suppose that U : M — > [0, oo) is a smooth function satis- 
fying 

1 U has compact level sets 

2 There exists C < oo and d € (0, 1) such that AU < C{1 + U) and 
||grad*7|| 2 < Ce eu . 

3 There exists an e > such that eU 1+s < 1 + ||gradC/|| 2 . 

4 There exists a B < oo such that for all x € M and all V eT x M, 

(VytiessuV) > -B\\V\\ 2 

where Hess^/ denotes the Hessian of U and the pairing is given by the 
metric. 
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Then (cf [Ahmed, 2000] Lemma 6.2), for each x € M, there is a unique path 
F^(x) : [0,oo) -»■ Msatisfying F (a;) = x and j^F^x) = -gmd F u^U. 

Moreover, F^ : [0, oo) x M — > M is a smooth map which is a diffeomorphism 
onto its image with differential a linear map everywhere bounded by e Bt . In 
particular, given a smooth form uj, the pullback (F^)*^; is bounded and if u is 
exact so is the pullback. Let ^wbe the orthogonal projection of (F^)*uj onto 
the space of L 2 [/-weighted harmonic forms on M. Then (cf [Ahmed, 2000] 
Theorem 6.4) 

Theorem 42 With M, U and $ as above, the map $ induces a linear 
isomorphism between the U-weighted I? cohomology ofM and the deRham 
cohomology ofM. The map is natural in the sense that $ u> is the element of 
minimal U-weighted L 2 norm. 

In addition to the work of Ahmed-Stroock, recent work of Gong- Wang 
[Gong, 2001] involving heat kernel estimates for a class of complete Rieman- 
nian manifolds containing those manifolds with Ricci curvature bounded be- 
low can be used to compute Hodge cohomology for Witten-deformed Lapla- 
cian in the top dimension. 

Finally, we mention the work of Elworthy, Li and Rosenberg on L 2 har- 
monic forms [Elworthy, 1998]. 

Recall, if M is Riemannian, the Weitzenbock decomposition of the Laplace- 
Beltrami operator on A>forms expresses the Laplacian in terms of the Levi- 
Civita connection and certain curvature invariants (2.23). When M is com- 
plete and the curvature term is positive, it is a theorem of Bochner that the 
corresponding cohomology in dimension k vanishes. 

In [Elworthy, 1998], the authors consider Riemannian manifolds whose 
Weitzenbock curvature term is strongly stochastically positive (when M is 
compact, this allows the curvature term to be negative on a set of small vol- 
ume). They establish a number of vanishing theorems and a variety of curva- 
ture pinching results; for example, they prove that a compact manifold cannot 
admit both a strongly stochastically 

positive 1Z 2 term and a metric with pinched negative curvature. Many of 
their results apply to the Witten Laplacian. The approach should yield a num- 
ber of additional results. 
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Abstract Let X\,...,X n be independent and identically distributed (iid) random vari- 

ables. We denote the sample mean X = n _1 E" =1 Xi and the sample variance 
S 2 = (n - l) _1 Er = i(Xi - X) 2 for n > 2. Then, it is well-known that if 
the underlying common probability model for the X's is N(fi,a ), the sam- 
ple mean X and the sample variance S 2 are independently distributed. On the 
other hand, it is also known that if X and S 2 are independently distributed, then 
the underlying common probability model for the X's must be normal (Zinger 
(1958)). Theorem 1.1 summarizes these. But, what can one expect regarding 
the status of independence or dependence between X and S when the random 
variables X's are allowed to be non-iid or non-normal? In a direct contrast 
with the message from Theorem 1.1, what we find interesting is that the sample 
mean X and the variance S may or may not follow independent probability 
models when the observations Xi's are not iid or when these follow non-normal 
probability laws. With the help of examples, we highlight a number of interest- 
ing scenarios. These examples point toward an opening for the development of 
important characterization results and we hope to see some progress on this in 
the future. Illustrations are provided where we have applied the i-test based on 
Pearson-sample correlation coefficient, a traditional non-parametric test based 
on Spearman-rank correlation coefficient, and the Chi-square test to "validate" 
independence or dependence between the appropriate x, s data. In a number of 
occasions, the t-test and the traditional non-parametric test unfortunately arrived 
at conflicting conclusions based on same data. We raise the potential of a major 
problem in implementing either a t-test or the nonparametric test as exploratory 
data analytic (EDA) tools to examine dependence or association for paired data 
in practice! The Chi-square test, however, correctly validated dependence when- 
ever (x, s) data were dependent. Also, the Chi-square test never sided against 
a correct conclusion that the paired data (x, s) were independent whenever the 
paired variables were in fact independent. It is safe to say that among three con- 
tenders, the Chi-square test stood out as the most reliable EDA tool in validating 
the true state of nature of dependence (or independence) between X, S 2 as ev- 
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idenced by the observed paired data x,s whether the observations X\, ...,X n 
were assumed iid, not iid or these were non-normal. 

Keywords: Frequency histogram, tests for independence, P-value, Chi-square test, Pearson- 
sample correlation test, t-test, Spearman-rank correlation test, nonparametric 
test. 



17.1 



Introduction 



Let us suppose that X\,...,X n are independent and identically distributed 
(iid) random variables governed by a common distribution function F(x), 
x 6 §?. We denote the sample mean X = n~ 1 T,f =1 Xi and the sample variance 
S 2 = {n- ly^T^^Xj- X) 2 where the sample size n(> 2) is held fixed. 
Now, the two statistics X and S 2 would be independently distributed if and 
only if we can write 

P F (X €AnS 2 €B} = P F {X € A}P F {S 2 e B} for all sets A C SR, 
BC3? + such that Ax B belongs to the Borel sigma-field over SR x di + . 

(1.1) 

Now, we present two illustrations successively through data analyses. 

Data Illustration 1 . 1 In order to examine whether the dependence or inde- 
pendence between X, S 2 can be checked out when we had some available 
data, we decided to generate random samples, each of size n — 5, from 
Normal(5,100) population. From each sample, we obtained the values of x, s 
thereby leading to the observed pairs (x~i,Si),i = 1, ...,500(= A;, say). The 
respective frequency histograms for x and u = (n — l)s 2 /a 2 are given in 
Figure 1. A joint plot of x and u is given in Figure 2. 




xfcar 



Figure 1. Marginal frequency histograms of a; and «(= s 2 /25) based on 500 observations 
from JV(5,100) distribution 
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xbar 



Figure 2. Aplotofa: andu(= s 2 /25)based on 500 observations from N (5,100) distribution 



In Figure 1, the x (or u) frequency histogram looks fairly symmetric (or 
skewed to the right). From the scatter plot in Figure 2, the x and u values seem 
to disperse independently of each other! 

For a more formal test of significance, however, we formed a 4 x 3 ta- 
ble (Table 1) of count data based on the observations, indicating how many 
from 500 pairs (xi, Ui) fell in each cell. Then, we simply used the customary 
Chi-square test of independence for the cell categories chosen for x, u. 

Table 1. Frequency and Expected Frequency of (x, u) 
In 500 Random Samples 



X 


(0,5) 


u = s 2 /25 
[5,7] 


(7,oo) 


Total 


(-oc,0) 
Exp. Freq. 


38(=Oi) 
42.24(=Ei) 


10(= o 2 ) 
9.12(=E 2 ) 


12(= 3 ) 
8.64(= E 3 ) 


60 


[0,5] 
Exp. Freq. 


117(= 4 ) 
121.79(= E 4 ) 


29 (= O s ) 
26.30(= E s ) 


27(=O s ) 
24.91 (=E 6 ) 


173 


(5,10] 
Exp. Freq. 


150(=O T ) 
144.32(= E 7 ) 


28(= Os) 
31.16(=E 8 ) 


27(= 9 ) 
29.52(=E 9 ) 


205 


(10, oo) 
Exp. Freq. 


47(= O 10 ) 
43.65(= E 10 ) 


9(=0 M ) 
9.42(= E n ) 


6(=Oi 2 ) 
8.93(=Ei 2 ) 


62 


Total 


352 


76 


72 


500 



At this point, we like to test the null hypotheses Hq : Categories based on 
x, u are independent against the alternative hypotheses Hi : Categories based 
on x, u are dependent, with the level of significance a = .05. Let Oj and Ej 
respectively denote the observed and expected frequencies (under .Ho) in the 



j cell, j = 1, ..., 12. Then, the test statistic is given by 



Xcalc 



Ere 
3=1 



(0-,-E,) 2 



with the degree of freedom v = (r — l)(c — 1) 



where r, c respectively denote the number of rows and columns. 



(1.2) 
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Now, the test statistic is 

2 _ (38-42.24) a , (10-9. 12) 2 , (12-8. 64) 2 , (117-121. 79) 2 , (29-26. 30) 2 
Xcalc — 42.24 "+" 9.12 + 8.64 "+* 121.79 ~*~ 26.30 

, (27-24. 91) 2 , (150-144. 32) 2 , (28-31. 16) 2 , (27-29. 52) 2 , (47-43. 65) 2 
"r" 24.91 ~*~ 144.32 "•" 31.16 "^ 29.52 "*" 43.65 

+&^§£ + ^^- = 4.4545 with v=(4- 1)(3 - 1) = 6 (13) 

degrees of freedom; P-value = 0.61535 
=> We do not reject the null hypotheses Ho at 5% level. 

That is, the observed data does not violate the postulate of independence be- 
tween x, s 2 values at 5% level. Incidentally, the P-value is calculated as fol- 
lows: 

P-value = P{Observing more extreme data when Ho is true} 
= P {xl > 4.4545, the observed Xcaio when Ho is true} with v = 6. 

We reject (do not reject) Ho with the level of significance a if and only if the 
P-value is less (not less) than a. A "large" P-value indicates less evidence 
against the null hypothesis Hq. 

Remark 1.1. Before one applies the Chi-square test (1.2), one needs to make 
sure that the expected frequency in each cell, that is each Ej,j = 1, ..., re, is 
five or more. Sometimes this restriction may severely impact on the number of 
cells that can be chosen. 

Remark 1.2. The sample correlation coefficient leading to a i-test is fre- 
quently used in practice to choose between the two hypotheses Ho, Hi if 
(X, U) could be treated as a bivariate normal random variable. We had the 
Pearson-sample correlation coefficient r? u = —0.080 with the P-value = 
0.072 which exceeded a, indicating that we should not reject the null hypothe- 
ses Ho at 5% level. But, we may not rely upon this test because the underlying 
assumption of bivariate normality of (X, U) does not hold here (see Figure 1). 
On top of that, the P-value barely exceeded a\ 

Remark 1.3. One may opt for a nonparametric approach to test Hq versus 
Hi by using the Spearman-rank correlation coefficient between the x, u data. 
Refer to Noether (1991, pp. 236-237), Lehmann (1986, pp. 350-351), or Gib- 
bons and Chakraborti (1992, Chapter 12) for details. What one does first is 
to rank all k observations on x and u separately. Then, the Spearman-rank 
correlation coefficient between the x, u data, denoted by r§ u , is simply the 
Pearson-sample correlation coefficient between the A; two-dimensional vectors 
of ranks. For the observed data, we found r§ u = —0.091. Under Hq, the prob- 
ability distribution of the test statistic Z = \/k — lr§ u is approximated by a 
standard normal distribution. One may refer to Noether (1991, pp. 236-237). 
We obtain z ca i c = —2. 0328, that is the associated P-value « 0.042073 which 
unfortunately falls below the nominal 5% level, indicating that we should re- 
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ject the null hypotheses of independence at 5% level. Thus, for the same data, 
the t-testand the nonparametric test came up with opposite conclusions! 

What will be different if the data is generated using a non-normal probability 
model? In order to get a feel for this, we provide the following illustration. 
Data Illustration 1 .2 We generated 500 random samples, each of size 
n = 5, from a Gamma(a = 4, /? = 8) model. From each sample, we ob- 
tained the values of x, s and the observed vectors (xj, Si),i = 1, ..., 500(= k). 
The respective frequency histograms for x and u — (n — l)s 2 /(a/? 2 ) are given 
in Figure 3. Ajoint plot of x and u is given in Figure 4. 

In Figure 3, both x, u frequency histograms look very skewed to the right, 
particularly in comparison with Figure 1. From the scatter plot in Figure 4, 
the x, u values seem to disperse in a dependent fashion. For example, if we 
observe a "small" value of x, then it seems unlikely that we will also observe a 
"large" value of u or equivalently a "large" value of s! For a more formal test 
of significance, however, we formed a 3 x 3 table (Table 2) of count data in 
each cell. Then, we simply used the Chi-square test (1.2). 

We may like to test if the categories based on x, s are independent 5% level. 
The test statistic from (1.2) is given by 

xlaic = £?=i (0 'i E? ' )2 = 91.627 with 4 degrees of freedom; P-value « 
=*■ We reject the null hypotheses Ho at 5% level. 

We reject independence between x, s values at 5% level. In order to claim 
that x,s values are dependent, note that one simply needs to contradict (1.1) 
for some Borel sets A, B. In Table 2, we constructed a precise system of nine 
Borel sets for which the multiplicative probability rule quoted in (1.1) does not 
hold! 



I «- 



i 




T- cA. 



Hn^ 



2) 



Figure 3. Marginal frequency histograms of x and u(= s 2 /64) based on 500 observations 
from Gamma(4,8) distribution 
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Figure 4. A plot of a; and u(= s /64) based on 500 observations from Gamma(4,8) distri- 
bution 



Table 2. Frequency and Expected Frequency of (x, s) 
In 500 Random Samples 



X 


(0,30) 


8 

[30, 40] 


(40, 00) 


Total 


(0, 10) 
Exp. Freq. 


75{= Oi) 
47,10(=Ei) 


110(=O a ) 

125.05(=E 2 ) 


0(= Qa) 

30.86(= E 3 ) 


203 


[10, 20] 
Exp. Freq. 


36(= 4 ) 
54.75(= E 4 ) 


162(= Os) 
145.38(= E s ) 


38 (= O a ) 
35.87(= E 6 ) 


236 


[20, oo) 
Exp. Freq. 


5(= O7) 
14.15(=E T ) 


27(= Os) 

37.58(=E S ) 


29(= 9 ) 
9.27(= E») 


61 


Total 


116 


308 


76 


500 



We had r£ a = 0.514 between the x, s data with the P-value 



x,s 



0. That 
is, the i-test sides with the earlier conclusion to reject the null hypotheses of 
independence between the x, s data at 5% level. But, also see Remark 1.2. 

One may again explore the nonparametric test. We found r§ a — 0.488, 
that is the test statistic z ca i c = y/k — lr§ s « 10.901 with the associated 
P-value » 0. That is, we would reject the null hypotheses of independence 
between the x, s data at 5% level. 

In these illustrations, one may want to know which of the two hypothesis 
was true? The following result will address this. Let us denote 

#*) = 7% ex p(-* 2 /2) and *(*) = /-oo <Kv) d v for x e & 

THEOREM 1.1 Suppose that Xi, ...,X n are iid random variables governed by 
a common distribution function F(x),x € 3?. Then, the sample mean X and 
the sample variance S 2 are independently distributed if and only if the common 
distribution of the X's is N(/j,,o~ 2 ), that is F(x) = $((a; — fi)/a)for some 
jj, € 3ft, o E 3? + andfor all x € U. 
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The if part is a well-known result that may be verified easily with the help of 
Helmert transformations (Mukhopadhyay (2000), pp. 197-201). See Section 
2 and Remark 2.1. The only if part, however, provides a characterization of 
a normal probability law which is quite hard to prove. Zinger's (1958) proof 
of the only if part requires deep analyses with an interplay of Cramer's (1946, 
pp. 151-165) fundamental results involving characteristic functions. Important 
historical notes may be found in Lukacs (I960), Ramachandran (1967, Chapter 
8), and Kagan et al. (1973). 

In Illustration 1.1, we generated data from a normal probability model, and 
hence we would have expected to favor Hq with the help of a "large" P-value. 
On the other hand, in Illustration 1.2, we generated data from a gamma proba- 
bility model, and hence we would have expected to favor H\ with the help of 
a "small" P-value In the first illustration, both Chi-square and t-tests came 
up with the correct answer, but the nonparametric test gave a wrong answer. 
In the second illustration, all three tests came up with the correct answer. But, 
one needs to keep in mind that in situations like ours, a £-test is not reasonable 
anyway! See Remark 1.2. 

It is safe to say that among three contenders, the Chi-square test (1.2) thus 
far stands out as the most reliable exploratory data analytic (EDA) tool in 
validating the true state of nature of dependence (or independence) between 
X, S 2 as evidenced by the observed paired data x, s when the observations 
X\, ...,X n are assumed iid. As the story unfolds, one will see that the Chi- 
square test would remain most reliable in the same sense when the observations 
Xi, ..., X n are not iid or these are non-normal. 

17.1.1. What If the Observations Are Not IID or They 
Are Non-Normal? 

In a direct contrast with the message from Theorem 1.1, what we find inter- 
esting is that the sample mean X and the variance S 2 may or may not follow 
independent probability models when the observations Xi's are not iid or when 
these follow some non-normal probability laws. We highlight examples depict- 
ing a number of interesting scenarios including the following: 

(i) X, S 2 follow ind_ependent probability models, each X's follows the same 
normal probability law, Xhas a normal probability model, S 2 has a Chi-square 
probability model, but the X's are dependent (Section 2); 

(«) X, S 2 follow independent probability models, the X's follow non-iden- 
tical but dependent normal probability laws, X has a normal probability model, 
S 2 has a_(non-central) Chi-square probability model, when n = 2 (Section 3); 

(Hi) X, S 2 follow dependent and uncorrelated probability models, X has a 
non-normal probability model, but X\ , X2 both follow standard normal prob- 
ability laws and they are dependent when n = 2 (Section 4); 
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(iv) X, S 2 follow dependent and uncorrected probability laws, X\ follows 
a standard normal probability law, X<i follows a mixture-normal symmetric 
probability law, X has a normal probability model, but X\ , X^ are dependent 
when n = 2 (Section 5); 

(v) X, S 2 follow independent probability laws, X has a normal probability 
model, S 2 does not have a Chi-square probability model, even if the observa- 
tions X\ , X2 are governed by one common bi-modal mixture-normal symmet- 
ric probability law when n = 2 (Section 6); 

(vj) X, S 2 follow independent probability laws, X does not have a normal 
probability model, S 2 has a Chi-square probability model, even if the observa- 
tions X\ , X2 are governed by one common bi-modal mixture-normal symmet- 
ric probability law when n = 2 (Section 6.1); 

(vii) X, S 2 follow independent probability laws, X has a normal probability 
model, S 2 does not have a Chi-square probability model, even if the obser- 
vations Xi,...,X n are governed by one common mixture-normal symmetric 
probability law when n > 2 (Section 7, Example 7.1); and 

(viii) X, S 2 follow independent probability laws, J{ does not have a normal 
probability model, S 2 has a Chi-square probability model, even if the obser- 
vations Xi,...,X n are governed by one common mixture-normal symmetric 
probability law when n > 2 (Section 7, Example 7.2). 

Each example, except the one mentioned in Section 2, is new as far as we 
know. The example cited in Section 2 was described in Rao (1973, pp. 196- 
197). In the abstract, we asked the following question: What can one expect 
regarding the status of independence or dependence between X,S 2 when the 
random variables X's are allowed to be non-iid or non-normal? The specific 
examples described in Sections 2-7 should clearly highlight the point that there 
is a large array of interesting possibilities when the random variables X's are 
allowed to be non-iid or non-normal. 

In order to formulate a general result, in our opinion, one has to focus on 
some particular nature of non-iid or non-normal probability model for the ob- 
servations Xi,...,X n and explore necessary and/or sufficient conditions for 
the independence between X, S 2 to hold. The examples here show that one 
may expect contrasting results even within scenarios which are "close" to each 
other. In other words, one would necessarily proceed on a case by case basis 
with regard to differing aspects of how non-iid or how non-normal the joint 
probability models are. This article points toward an opening for the develop- 
ment of important characterization results and we hope to see some progress 
on this in the future. 

In the case of Examples (iii)-(v), illustrations through simulated data are 
provided in our attempt to examine the performances of the Chi-square test, 
t-test, and the nonparametric test in detecting dependence of x\, X2 data as well 
as the dependence (or independence) of x, s 2 data. In a number of occasions, 
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the t-test and the nonparametric test arrived at conflicting conclusions based 
on same data. We raise the potential of a major problem in implementing 
either a t-test or the nonparametric test as EDA tools to examine dependence or 
association for paired data in practice! The Chi-square test, however, correctly 
validated dependence under consideration in every single case, and this test 
never sided against a correct conclusion when the paired variables were in fact 
independent. We conclude that in this sense, the Chi-square test (1.2) stands 
out as the most reliable EDA tool whether the observations Xi,...,X n are 
assumed iid, or not iid, or non-normal. 

17.2 A Multivariate Normal Probability Model 

This interesting situation in the context of a multivariate normal probability 
model was described in Rao (1973, pp. 196-197). Consider n-dimensional 
vector- valued random variable X where X = (Xi ,..., X n ) . We assume that X 
has the n-dimensional normal distribution 7V n (/il,(T 2 £) with 
l' = (l,...,l)ixn,S nxn = (1 - p)l nxn +pll where -oo < // < oo, 
< a < oo,p 6 (— (n — l) -1 , 1) — {0}, and I is the n x n identity ma- 
trix. 

We may define the associated Helmert variables (Mukhopadhyay (2000), 
pp. 197-201) Yi,Y 2 ,...,Y n where 

yi = (Xi + ... + X„)A/n, 

Yi = {Xi + ...+Xi-! - (t - l)Xi}/y/i(i -l),i = 2, ...,n. 

This constitutes an orthogonal transformation from (Xi,X2, ...,X n ) 
to (Y\,Y2,...,Y n ). One can easily derive the joint probability model for 
Y\,Y2, ...,Y n from the assumed joint probability model of Xi, X2, ..., X n , and 
hence conclude in a straightforward manner that 

the random variables Y\ , Yb , • ■ . , Y n are independent, 
Yi ~ N(n^n, [1 + (n - 1)/o]<t 2 ), and Y 2 , ..., Y„ are iid N(0, (1 - p)cr 2 ). ( " ) 

Obviously, we also have 



=2 



(n - 1)5 2 = E? =1 (Xi - X) 2 = Er =1 X? - nX' = E^i^ 3 - y 2 = E?^ 2 , (2.3) 

since E" =1 Xj 2 = Hf =1 Y^. Now, we note that X depends only on Y\ whereas 
SP depends only on (I2, ..., i^),but Y\ is independent of (Y2, ..., Y n ). Hence, 
X and S 2 are independently distributed statistics. It is now quite straightfor- 
ward to check that X(= Y\/y/n) is distributed as N(ix,n~ l cr 2 ) and 
(n — l)S 2 cr~ 2 is distributed as (1 — p)Xn-i- ^ e ma y summarize the find- 
ings as follows: 

X is distributed as N(n, n~ l [1 + (n - l)p]cx 2 ), 
(n - 1)S 2 is distributed as (1 - p)<r 2 xl-u (2.4) 

and X, S 2 are independently distributed. 
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In this example, one readily notices that the if part in Theorem 1.1 does hold, 
that is X and S 2 are independent, when each observation Xi, ..., X n follows 
the same N(p,a 2 ) probability law so that they are identically distributed, but 
X\, ..., X n are dependent as p is assumed different from zero! 
Remark 2.1. A proof of the if part in Theorem 1.1 follows from the derivation 
given above if we assume that p = 0. 

17.3 A Bivariate Normal Probability Model 

Let us start with a two-dimensional random variable X where X = (Xi , X2) 
and let X be governed by the probability model /V 2 (/i, cr 2 E) where 

Hi \ ^ f c+1 c-1 



V-{ „ 2 )'^-y c -i c+ i 

with —00 < p\,P2 < 00,/ii ^ ^2)0 < o < 00, and < c < 1. In 
other words, we have X\, X2 respectively distributed as N(p,i, (c + 1)<7 2 ) and 
N(n2, (c + l)c 2 ). but they are dependent. 

Now, let us define two random variables Y\ = X\ + X%, Y2 = X\ — X2 and 
denote Y = {Y\, Y2). Observe that any arbitrary linear function U ofYi, Y2 is 
clearly a linear function of X. Now, since X is distributed as 7V 2 , the random 
variable [/must have a univariate normal distribution. Thus, the random vector 
Y would have a bivariate normal probability model, say A^fl, er 2 X!*) where 
0' = (0i, 9 2 ), 0i 3_/ii + y«2, 2 =px- p 2 , S* = diag(4c, 4). 

Thus, we have X = ±yi ~ A^(^(/z 1 +// 2 ),ca 2 )and F 2 ~ N(ni-p 2 A^ 2 )- 
Obviously, Yi, F2 have independent probability models since £* is a diagonal 
matrix so that X and S 2 = ^(-X"i — X2) 2 = ^Kj 2 are independently distributed. 

This example provides a different scenario from the one described in Section 
2 when we fix n = 2. Here, we note that marginally Xi,X 2 have non-identical 
and dependent normal probability models. But, X and S 2 are independent] 

17.4 Bivariate Non-Normal Probability Models: Case I 

Let us denote the probability density function (pdf) of a bivariate normal prob- 
ability model N 2 (9, S) with 

V 02 y V pwi °2 / 

by g{w\, w 2 ; 0i,02, cri,CT2,p) where -00 < ^ 1; 2 < 00, < cti,ct 2 < 00, 
— 1 < p < 1, and — 00 < wi,W2 < 00. In other words, let us denote 



g(w 1 ,w 2 ; 0i, 2 , cri,cr 2 ,p) = cexp 



-gO- ~ P 2 )" 1 "^? - 2 ^i"2 + «!} 



(4.1) 
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where 

U\ = (w\ - 0l)/Vl , U 2 = (l02 - 02)1(72, 

2^-1 (4 ' 2) 

C = {27T(T 1 (T2(1 — /9 J 2 } , — OO < Wl,W2 < OO. 

Now, we construct an example of a two-dimensional random variable X 
withX = (X\,X2) where X\,X2 both follow the N(0, 1) probability law, 
Xi , X2 have dependent probability models, neither X\ + X2 nor X\ — X2 fol- 
lows a normal probability law, but X, S 2 have dependent probability models. 
Here, one finds an example where X\ , X2 have identical but dependent normal 
probability laws, and yet X, S 2 have dependent probability models. 

To be specific, we consider the joint probability model for an observation 
3^2x1 governed by the pdf 

f(xi,x 2 ;a, p) = ag(xi , x 2 ; 0, 0, 1, 1, p) + (1 - a)g{x\ , x 2 ; 0, 0, 1, 1, -p) 

(4.3) 
for — 00 < £1,2:2 < oo,0 < a,p < 1. The pdf given in (4.3) is a mixture of 
two bivariate normal models. 

THEOREM 4. 1 Suppose that (Xi , X2) has the joint pdf from (4.3). Let us denote 
Y\ = X\ + X2 and Yjj = X\ — X 2 - Then, for all < a, p < 1, we have the 
following: 

(j) Both X\ , X2 have a standard normal probability model, but these are dependent; 
(ii) The joint probability model ojYi,Y2 is governed by the pdjjrom (4.4), but X has 
a mixture normal probability model with its pdj from (4.7), andY2 has analogous 
mixture normal probability model with its pdjjrom (4. 6); 
(Hi) Yi,Y2 are dependent, and so areX , S ; 
(iv) Yi, Y£ are uncorrelated, q = 1, 2; 
(v) X , S are uncorrelated. 

PROOF (i) From the joint pdf f(xx,X2\a, p), by integrating x± or X2 out, 
one easily verifies that marginally both X\ , X 2 have a standard normal prob- 
ability model so that their common pdf is 4>(x) = -4= exp(— x 2 /2) with 

— oo < x < oo. Now, observe that /(0,0;ai, p) = 4 ^ </> 2 (0) = ^, 

whatever be < a, p < 1. Thus, the random variables X\, X2 have identical 
but dependent probability models. 

(if) Next, we consider Y\, Y2 and then with the help of the one-to-one trans- 
formation from (xi, X2) — * (2/1, 2/2) we can write down the joint pdf of Y\, Y 2 . 
Toward this end, we begin with 

f(xi,x 2 ;a,p) = 2 t_ 2 |aexp[- 2(1 ^ pa) (3:?+ j 2 , - 2pxiX2)\ 
+(1 - Q)exp[- 2(1 ^ pi) (a;f + x\ + 2pxxx 2 )\ j , 

for —00 < X\,X2 < 00. Observe that x\ — \(y\ + J/2), #2 = 5(2/1 — 3/2) ar *d 
hence the Jacobian matrix amounts to 

J = Hi la =( ? J ) ^ |det(J)| = I. 
\ Bvi e V2 J \ 2 2 / 
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Thus, the joint probability model of Y\, Y 2 is governed by the following pdf: 

h(yuvi\a,p) 

= 2w /\_ p2 {«exp[- 4(1 ^ p2) (j/i + 2/2 - ml + pyl)] 

+ (1 - a) exp[- jji^iy^i 2 + yl + pyl - pyl)]) x ±, 



which simplifies to the expression 

\ ae -\vl/Q+p) e -\vl/Q--p) 



h(yi,yr,a,p) ■■ 



Air^l - p 2 

+(1 - ^e-^i/Ci-^e-l^l/^+P) 

for — oo < 2/1,2/2 < oo. (4.4) 

Now, by integrating y\ or y 2 out from the joint pdf h(yi, 2/2; ot-> p) one easily 
verifies that the marginal pdf s of Yi, Y2 are respectively given by 

"•(»'•«•"' - i7&3«- WA,+ " + (1 - a) st^" 1 """"" 

for —00 < 2/1 < 00, 

(4.5) 
h 2 (y 2 ;a,p) = a .) e'H/(i-P) + (1 _ a) , 1 e -frS/(i+p) 

^ v ^' "^ 20r(l-p) V ^^(l+p) 

for —00 < 2/2 < 00. 

(4.6) 
Both /ii(2/i;a,/9),/i2(2/2;o!, p) happen to be mixtures of N(0, 2 — 2p) and 
iV(0, 2 + 2p) distributions. From (4.5), it is obvious that the probability model 
for U = X(= \Y\) will be governed by the pdf 

h*(u;a,p) = o- r i--e- 1 * a /(i+p) + (1 _ a ) 1 e -" 2 /d-p) 
for — 00 < u < 00. 

The pdf h*(u;a,p) happens to be a mixture of N(0,^(1 — p)) and 
JV(0, 5(1 + p)) distributions. 

(Hi) Next, by combining (4.4)-(4.6) we observe that /i(0, 0; a, p) = 1 — - 

A-Ky/l—p 2 

whereas 



and 



/ii(0;a 


p) = 


1 r a 


+ 


1- 


a 


2^V 1 + <° 


JTp] 


h 2 (0;a 


,P) = 


1 r a 


+ 


1- 


- a 
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In other words, we would conclude that /i(0, 0; a, p) = /ii(0; a, p)h 2 (0; a, p) 
if and only if 

* = [ ° + \-<* if » + -i?£-l 

«■ 1 = [ax/r=7 + (1 - a) vT+7][q n /TTp + (1 - a)^/^=~p\ 
& p 2 = 1 - [1 - 2o(l - a)] 2 [a 2 + (1 - a) 2 ]- 2 = 0, 

since a 2 + (1 — a) 2 = 1 — 2a(l — a). But, we have assumed thatp is pos- 
itive! That is, we have h(0,0;a,p) ^ hi(0;a, p)h 2 (0;a, p), whatever be 
< a < 1. Hence, whatever be < a < 1, the random variables Yi, Y2 are 
dependent, that is X,S 2 have dependent probability models since 

yi. == o-* 1) *^ == 2 2 ' 

(iv) We obviously have £[Yi] = £[F 2 ] = 0, V[Xi] = V[X 2 ] = 1, 
V[yi] = F[y 2 ] = 2[l-/o(l-2a)]. 

Also, note that Cou(yi,y 2 ) = V[Xi] - V[X 2 ] = 1 - 1 = 0, so that the 
Pearson correlation coeffi cient betwee n the observations Y\ , Y 2 is given by 
Pri,Y a = Cov(Yi,Y 2 )/y/V\Yi]V[Y 2 ] = 0. That is, the observations Y U Y 2 
are uncorrelated whatever be < a, p < 1. 

Next, using (4.4), let us evaluate the covariance between the random vari- 
ables Y\ , Y 2 and express 

Co«(yi,y 2 a ) = £[yy 2 2 ] - £[y,].E[y 2 2 ] = ^my, 2 ], 

whereas. .E^yY^ 2 ] may be found as follows: 

+ (!-«) /£=-«„ /"»_«, J/iy 2 2 e-^? /(1 -">e-i^/( 1 +")dy 1 d 2/2 ] 
= ^T7 [«/^=-co^- i * S/ll - p) ^/ w T=-coVi«- i, ' i/(1+P) ^i (4 ' 8) 

= 0. 

Also, V[Yi] and V*[Yj 2 ] are both finite so that the Pearson correlation coefficient 
between the observations Y\ , Y 2 2 is given by 

p YuYi = Cov(Y 1 ,Yi)/^V[Y 1 ]V[Yi\ = 0. 

Thus, Y\ , Y 2 are uncorrelated. 

(v) This follows from part (iv) since X — ^Y\ and S 2 = \Y$. ■ 
Remark 4.1 Recall that E\X X \ = E[X 2 ] = and V[Xi] = V[X 2 ] = 1. Also, 
we can easily write 

E[X!X 2 ] = ap + (1 - a){-p) = (2q - l)p. 
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Thus, we have 



Cov(X u X 2 ) = E[XiX 2 ] = (2a - l)p, 



so that the Pearson correlation coefficient between the observations X X ,X 2 
is given by p Xl ,x 2 = Cov(X 1 ,X 2 )/^/V[X 1 }V[X 2 ] = (2a - l)p. Hence, 
whatever be < p < 1, the observations X\,X 2 are uncorrelated if and only 
ifa=i. 

DATA ILLUSTRATION 4. 1 We focus on working under the pdf from (4.3) when 
a = p = I} and compare performances of the Chi-square test, f-test, and the 
nonparametric test in detecting dependence within (x\,xi) data and within 
(yi)2/2) data. Thus, we generated 500 random pairs (xii,x 2 i),i = 1,...,500 
(= k) governed by the joint probability model (4.3) with a = p = ^. Sub- 
sequently, we obtained (yu,y2i) where y Xi — xu + x 2 i,y 2i = x xi - x 2 i, 
i = 1, ..., k. The frequency histograms for x x ,x 2 and yi,y 2 are given in Fig- 
ures 5-6. The plots of x x vs x 2 and y\ vs y2 are given in Figure 7. 
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Figure 5. Marginal frequency histograms of xi and x 2 obtained from observations with the 
joint distribution (4.3), a = p = | 
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Figure 6. Marginal frequency histograms of y\ = xi + x 2 and y 2 = <ci — £2 obtained from 
(a>i, £2) observations with the joint distribution (4.3), a = p = 5 
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From Figures 5-6, we observe that the frequency histograms for Xi, x<i and 
2/1, 2/2 have heavy tails on either side. In Figure 7, the two scatter plots seem to 
indicate that both x\, X2 and 2/1,2/2 data are dependent as they are expected to 
be so. 

For a test of significance, however, we formed a 4 x 4 table (Table 3) of 
count data of how many pairs (xu, x^i) fell in each cell. Then, we used the 
Chi-square test (1.2). 

Next, we test whether the categories based on the x\, X2 data are indepen- 
dent at 5% level, and the test statistic from (1.2) is given by 

xLic = £)=i (0 ' V E; ' )2 = 34.611 with 9 degrees of freedom; P-value = 0.0000698 
=> We reject the null hypotheses Ho at 5% level. 

Since the P-value is "small", we reject the hypothesis of independence be- 
tween xi,X2 values at 5% level. 
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Figure 7. Plots of an vs £2 and y\ vs y 2 obtained from (an, £2) observations with the joint 
distribution (4.3), a = p = \ 



Table 3. Frequency and Expected Frequency of (a; 1, x-i) 
In 500 Random Samples 



Xl 


(-oo,-l) 


[-1,0] 


£2 

(0,1] 


[1,00) 


Total 


(-00,-1) 
Exp. Freq. 


17(=0,) 
9.60(= Ei) 


20 (= Oa) 
23.68(= E 2 ) 


14(=Os) 
22.40(= E 3 ) 


13(= 4 ) 
8.32(= E 4 ) 


64 


I-1.0] 

Exp. Freq. 


27(= Ob) 
28.05(= E 5 ) 


72 (= Ob) 
69.19(=E 6 ) 


74(= O7) 
65.45(= E 7 ) 


14(= Os) 
24.31(= E 8 ) 


187 


(0,1] 
Exp. Freq. 


21(=0») 
27.75(= Eg) 


73(= O10) 
68.45(= E10) 


71(=Ou) 

64.75(=E n ) 


2()(=O i2 ) 
24.05 (= Ew) 


185 


[l,oo) 
Exp. Freq. 


10(=Ois) 
9.60(=Ei 3 ) 


20 (= On) 
23.68(= E14) 


16(=Oi B ) 
22.40(= B15) 


18(=Oi«) 
8.32(= Ew) 


64 


Total 


75 


185 


175 


65 


500 



412 RECENTS ADVANCES IN APPLIED PROBABILITY 

Similarly, we formed a 4 x 4 table (Table 4) of count data of pairs (yu, yu) 
that fell in each cell and proceeded to use the Chi-square test (1.2). 

Now, for testing the independent of the categories based on 2/1 , 2/2 values at 
5% level, the test from (1.2) gives 

xlaic = £}=i (0j e- E ' )2 = 17 - 722 with 9 degrees of freedom; P-value = 0.038539 
=> We reject the null hypotheses Ho at 5% level. 

We reject independence between 2/1,2/2 values at 5% level since we observe a 
"small" P-value. 

With regard to t-tests, we respectively found r£ lX2 = 0.124 and 
r m,y2 ~ ~ 0.045 with associated P-values = 0.005 and 0.315. That is, the 
t-test based onr^ ll2 will side with the conclusion that £1, £2 data are depen- 
dent at 5% level, but an analogous t-test based on r^j y2 unfortunately gives a 
wrong message at 5% level! 



Table 4. Frequency and Expected Frequency of (yi , 2/2) 
In 500 Random Samples 

2/2 

2/i (-oo,-l) [ 1,0] (0,1] [l,oo) Total 

(-00,-1) U(=Oi) 31(=O a ) 35(=0 3 ) 25(=0 4 ) 105 

Exp, Freq. 17.43(=EQ 33.18(= E 2 ) 31.71(=E 3 ) 22.68(= E 4 ) 

[-1,0] 32(= O5) 40(= O e ) 56(= O7) 31(= O s ) 159 

Exp. Freq. 26.39(= E s ) 50.24(^ E 6 ) 48.02(= E 7 ) 34.34(= Eg) 

(0,1] 27(= 9 ) 66(= Oio) 28(=On) 29(= 12 ) 140 

Exp. Freq. 23.24(=E 9 ) 44,24(= E10) 42.28{=En) 30.24(= En) 

[l,oo) 10(=Oi 3 ) 31(=Oi 4 ) 32(= O15) 23(= Oie) 96 

Exp. Freq. 15.94(= E13) 30.34(= E i4 ) 28.99(= E15) 20.74(= E«) 

Total 83 158 151 108 500 

Next, with regard to the nonparametric test, we found r Xl X2 = 0.090 and 
r yi,2/2 = ~ 0.058, along with test statistics z ca i c = y/k — lr XlX2 « 2.0104 
and Vk~—Try iy2 sa —1.2956 respectively. The associated P-values were 
0.044389 and 0.19511 respectively, indicating that we would (would not) re- 
ject the hypotheses of independence between £1,2:2 values (2/1,2/2 values) at 
5% level. That is, the nonparametric test for y\ , 2/2 data leads to an incorrect 
inference in this example! 

17.5 Bivariate Non-Normal Probability Models: Case II 

Let us repeat the earlier notation from Section 4. Now, we give an example 
of a two-dimensional random variable X with X = (X\,X2) where Xi is 
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normally distributed, X2 is not normally distributed, X is normally distributed, 
X\ — X2 is not normally distributed, but X and S 2 are dependent random 
variables. 

Recall the function g(wi,W2]0i,92,<Ti,(T2,p) from (4.1). Suppose that 
X2xi has its pdf given by 



f(x 1 ,x 2 ; a) = ag(xi,x 2 ;0, 0, 1, 2, 0.5) + (1 - a)g(xi, x 2 \ 0, 0, 1, 3, -0.5) 

(5.1) 
for —00 < xi,x 2 < 00, < a < 1. 

THEOREM 5.1 Suppose that {X\,X2) has the joint pdf from (5.1), Let us denote 
Y\ — X\ + X2 and Y? = X\ — X 2 . Then, for all < a < 1, we have the 
following: 

(7) X 1 has the standard normal probability model, X2 has a mixture normal probabi- 
lity model governed by the pdf from (5.2), and they are dependent; 
(jj) The joint probability model of Yi, Y2 is governed by the pdf from (5.3), but X 
has N(0, j) distribution with pdf from (5.4), and Y2 has a mixture normal prob- 
ability model with its pdf from (5.5); 
(Hi) Y\,Yi are dependent, andsoareX,S ; 
(rv) Yi , Vj are correlated, but Y\ , Y% are uncorrelated; 
(v) X, S are uncorrelated. 

PROOF (/) From the joint pdf f(x\,x 2 ;a), by integrating x\ or X2 out, 
one easily verifies that X\ has the #(0,1) distribution with its pdf 
<f>(x) = -4= exp(— x 2 /2) for — oo < x < oo, but the marginal pdf of X2 
is given by 



/ 2 (x 2 ;a) = a^e-*i/ 8 + (1 - a)^-*!/ 18 for - 00 < x 2 < oo, 

(5.2) 
whatever be < a < 1. It is clear that /2(x2; <*) happens to be a mixture of 
#(0,4) and #(0,9) probability models. 

Now, observe that /(0,0;a) = g^-(2 + a) whereas <f>(0) / 2 (0; a) = 

j^(2 + a), so that we have /(0, 0;a) ^ (f)(0) f 2 (0; a). Hence, the random 
variables Xi , X2 have the dependent probability models. 

(ii) We have Y\ = X\ + X2, F2 = X\ — X2. Then, along the line of deriva- 
tion for Theorem 4.1 part (ii), we can again use transformation techniques to 
express the joint pdf of Yi, Y2 as follows: 



%i,!/2;a) ="ff(s/iiW2;0,0,v'7,v'3, — —\ 

+ (1 - a)g U, y 2 ; 0, 0, y/7, \/l3, --|=) 



(5.3) 
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for — oo < yi,y 2 < oo. From (5.3), it is obvious that marginally, Y\ is dis- 
tributed as N(0,7) with its pdf 

hi(yi) = -= / _ e~ y V u for - oo < y x < oo, (5.4) 

V7v27r 

whatever be < a < 1. However, the marginal pdf of Y 2 is given by 

MW «) = «^fe e " vi/6 + (1 - «)7Tfe e_2/22/26 ^ - oo < y 2 < oo, 

(5.5) 
which happens to be a mixture of #(0,3) and 7V(0,13) probability models. 
Obviously, X — ^Y\ is distributed as iV(0, |). 

(iii) Next, by combining (5.3)-(5.5) we observe that /i(0,0;a) = 

i(* + 72?) whereas fc i(°) = Tfe' and ^(°: a ) = vfel* + Till- 

In other words, we would conclude that h(0, 0;a) = hi(0)h2(0;a) if and 
only if 

1 / a _i_ 1 — a \ __ 1 1 r a _j_ 1 — a j 

**■ T^ = (791 ~~ 737^/(712 ~ 72?)' 
which is a negative number! But, we have assumed that a E (0, 1) so that we 
immediately conclude that h(0, 0; a) ^ /ii(0)/i2(0; a) whatever be < a < 1. 
Hence, for all < a < 1, the random variables Y\,Y2 are dependent, that is 
X, S 2 also have dependent probability models since X — \Y\, S 2 = \Y 2 . 

(iv) From (5.4)-(5.5), we obviously have E[Yi] = E[Y 2 ] = 0, V[Ki] = 7, 
and V[Y 2 ] = 13 - 10a. From (5.3), we note that 

Coo(Y u Y 2 ) = E\Y X Y 2 ] 

= a(--| ! )(v / 7)(v^) + (1 - a )(--~)(V7)(vl3) 

= 5a - 8, 

which is certainly non-zero. Hence, the Pearson correlation coefficient be- 
tween the observations Y\ , Y 2 is given by 



p Yu Y 2 = Cov{Y l) Y 2 )/^V[Y l ]V[Y 2 ) = (5a - %)/y/l{U - 10a). 

Hence, the observations Y X ,Y 2 are correlated whatever be < a < 1. 

Next, using (5.3) again, let us evaluate the covariance between the random 
variables Y\ , Y 2 and express 

Cov(Y u Y?) = E\YiY?) - E[Yi]E{Y 2 2 } = E\YiY?], 

whereas ElYyY^] may be found as follows: 

E\yiE{Y? | Y 1 }} 
= aE[Y{f + &Y?}] + (1 - a)E\yi{% + §1?}] (5.6) 

= 0, since E{Y X ] = E[Y?) = 0. 
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Again, both V[Yi], ^[^ 2 ] are finite so tnat the Pearson correlation coefficient 
between the observations Y\ , Y$ is given by 



p YuY 2 = CoviY^Y^/^/viY^VlYi] = 

Hence, Y\ , F 2 2 we uncorrelated. 

(v) This follows from part (iv) since X = 5F1 and S 2 = \Y$. ■ 
Remark 5.1 We note that E[Xi] = E[X 2 ] = 0,V[Xi] 
V\X 2 ) = E[X%] = 4a + 9(1 - a) = 9 - 5a. We can also express 



= 1, 



E[X 1 X 2 ] = a(l)(2)i + (1 - a)(l)(3)(-|) = ±(5a - 3). 



Thus, we have 



Cov(X u X 2 ) = E[XiX 2 ] = |(5a - 3), 



so that pxi,x 2 = \{§ a ~ 3)/\/(9 ~ 5a). That is, the observations Xi, X2 are 
uncorrelated if and only if a = | . 

DATA ILLUSTRATION 5.1 We focus on working under the pdf from (5.1) when 
a = \ and compare performances of the Chi-square test, t-tests, and the non- 
parametric test in detecting dependence for {x\,X2) and (2/1,3/2) data. Thus, 
we generated random pairs (xu,X2i),i = 1,...,500(= k) governed by the 
joint probability model (5.1) with a = ^. Subsequently, we obtained (yu, y^i) 
where yu = xu + X2i,y2i = Xu — x 2 i, i = 1, ..., k. The frequency histograms 
for xi, X2 and 2/1,2/2 are given in Figures 8-9. The plots of x\ vs X2 and y\ vs 
2/2 are given in Figure 10. 
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Figure 8. Marginal frequency histograms of a;i and 0:2 obtained from observations with the 
joint distribution (5.1), a = | 

From Figure 8, we observe that the frequency histograms for both x\, X2 are 
skewed, whereas from Figure 9, the frequency histograms for both 1/1,2/2 have 
heavy tails on either side. In Figure 10, the two scatter plots seem to indicate 
that both x\,X2 and 2/1,2/2 are dependent as they are expected to be. 

For a more formal test of significance, however, we formed a 5 x 3 table 
(Table 4) of count data of how many pairs (xu, X2%) fell in each cell and used 
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Figure 9. Marginal frequency histograms of y\ = x\ + X2 and t/2 = X\ — #2 obtained from 
(xi , X2) observations with thejoint distribution (5.1), a = | 
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Figure JO. Plots of Xi vs X2 and 2/1 vs 2/2 obtainedfrom (xi, X2) observations with thejoint 
distribution (5.1), a = \ 



the Chi-square test (1 .2). 



Table 5. Frequency and Expected Frequency of (x\ , X2) 
In 500 Random Samples 



X! 


(-00,-1) 


Xl 

[-1,0] 


(0,oo) 


Total 


(-00, -2) 
Exp, Freq. 


19(=Oi) 
18.36(=Ei) 


27(= 3 ) 

29.70(= E 2 ) 


44(= O3) 
41.94(= E 3 ) 


90 


[-2,0] 
Exp. Freq. 


30{= 4 ) 
34.27(= F.„) 


53(=0 5 ) 
55.44(= E 5 ) 


85{= Ob) 
78,29(= E 6 ) 


168 


(0,2] 
Exp. Freq. 


20(= O7) 
26.93(= E 7 ) 


50(= Oa) 
43.56(= E 8 ) 


62(= Og) 
61.51(=Eg) 


132 


(2,4] 
Exp. Freq. 


20(=Oio) 
16.93(=Eio) 


26(= On) 
27.39{=En) 


37(=O ia ) 
38.68{= E i2 ) 


83 


(4,oo) 
Exp. Freq. 


13(=Oia) 
5.51(=Ei 3 ) 


9(=0 M ) 
8.91(=E M ) 


5(=Oi 5 ) 
12.58(=Ei 5 ) 


27 


Total 


102 


165 


233 


500 
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Next, we test whether the categories based on the x\, x% data are indepen- 
dent at 5% level, and the test statistic from (1.2) is given by 

xLic = £}=i (0 >~ E ' )2 = 19.782 with 8 degrees of freedom; P-value = 0.011193 
=*■ We reject the null hypotheses Ho at 5% level. 

In this example, we should have expected to see a "small" P-value which we 
did. Thus, we reject independence between £1,0:2 values at 5% level. 

Table 6. Frequency and Expected Frequency of (2/1,2/2) 
In 500 Random Samples 



1/1 


(-oo,-l) 


J/2 

[-1,0] 


(0,1] 


(l,oo] 


Total 


(-00,-1) 

Exp. Freq. 


16(=Oi) 
60.53(=Ei) 


18(=Qa) 
29.75(= E 2 ) 


23(= O3) 
23.26(= E 3 ) 


114(=0„) 
57.46(= E 4 ) 


171 


[-1.0) 
Exp. Freq, 


18(=0 5 ) 
33.28(= Be) 


20(= Oe) 
16.36(= E s ) 


24(= O7) 
12.78(= E 7 ) 


32(= 8 ) 
31.58(=E 8 ) 


94 


(0,1] 
Exp. Freq. 


24 {= O b ) 
22.66(= Eg) 


19(=O| ) 
n.l4(=Ei ) 


10(=Ou) 
8.70(=E„) 


ll(=O w ) 
21.50(= E12) 


64 


(l,oo] 
Exp. Freq. 


119(=Ois) 
60.53(= En) 


30(= O14) 
29.75(= E u ) 


ll(=Oi 5 ) 
23.26(= E 16 ) 


ll(=Ou) 
57.46(= Ei 6 ) 


171 


Total 


177 


87 


68 


168 


500 



Also, we carry out similar analysis with the y\ , 3/2 values by forming a 4 x 4 
table (Table 6) of count data of how many pairs (yij, 3/21) fell i n eacn ce H an d 
used the Chi- square test (1.2) to check whether the categories based on the 
2/1 > 2/2 data were independent at 5% level. The test statistic from (1.2) is given 
by 

xlaic = ZjLi (0j E Ej)2 = 222.175 with 9 degrees of freedom; P-value « 
=» We reject the null hypotheses Ho at 5% level. 

Here, we may have expected to see a "small" P-value which we do. Thus, we 
reject independence between 2/1,2/2 values at 5% level. 

We mention that we found rf" To = —0.123 and r£ „„ = —0.728 with 

• li Li J '2 3/1,3/2 

the associated P-value = 0.006 and P-value ss respectively. So, the 
t-test based on r£ liX2 and Ty liV2 respectively sided with the conclusions that 
the x\,X2 data and 2/1 » 2/2 data were dependent at 5% level. 

With regard to the nonparametric test, we observed r Xl X2 = —0.082 and 
r yi,V2 ~ -0.712 along with the test statistics z ca [ c = y/k — lrf X2 w —1.8317 
and Vk~—lry iy2 tx —15.905 respectively. The associated P-values were 
.066996 and nearly zero respectively for the two datasets, indicating that the 
test would not (would) reject independence between x\, X2 values (j/ lf y 2 val- 
ues) at 5% level. That is, the test using the Spearman-rank correlation coef- 
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ficient between the yi , y 2 data led to correct inference, but an analogous test 
gave an incorrect decision for the xi, x 2 data. 

17.6 A Bivariate Non-Normal Population: Case III 

We repeat the notation from Section 4 and give an example of a two-dimen- 
sional random variable X with X = (X\,X 2 ) where X\,X% are identically 
distributed with a common non-normal distribution, X\ + X 2 is normally dis- 
tributed, X\ — X2 is not normally distributed, but X, S 2 are independent ran- 
dom variables. 

Recall the function g(wi,W2;9i,02,o~i,o~ 2 ,p) from (4.1). Suppose that X 
has its pdf given by 

f{x\,x 2 \a) - ag(xi,X2\-5,5, 1,1, .5) + (1 - a)g(x 1 ,x 2 ; 5,-5,1, 1, .5) 

(6.1) 
for —00 < Xi,X2 < oo,0 < a < 1. 

THEOREM 6.1 Suppose that (Xi,X 2 ) has the joint pdf from (6.1). Let us denote 
Y\ — X\ + X 2 and Y 2 — Xi — X 2 . Then, for all < a < 1, we have the 
following: 

(i) Both Xi, X2 have mixture normal probability models governed by the pdf from 

(6.2), and they are dependent; 
(ii) The joint probability model of Y\,Y2 is governed by the pdf from (6.3), but X has 
N(0, j) distribution, and Y% has a mixture normal probability model with its pdf 
from (6.5); 
(Hi) Yi,Y% are independent, andsoareX,S . 

PROOF (j) From the joint pdf f(x i,x 2 ; a), by integrating x\ or X2 out, one 
easily verifies that X\ and X 2 respectively have the marginal pdf 's 

Mx i; a) = a^=e-^+ 5 ) 2 /2 + (1 _ q) * -d-^/a 

V27T V27T 

for — oo < x\ < oo, 

(6.2) 

f 2 (x 2 ;a) = a -^=e-^-^ 2 / 2 + (1 - a )-±= e -^ + V 2 ' 2 



for — oo < X2 < oo 



whatever be < a < 1. It is clear that both f\(x\; a), f 2 {x 2 ; a) happen to be 
mixtures of N(-5,l) and N(5,l) probability models. 

Next, observe that /(0, 0; a) = ^e~ 50 whereas /i(0; a)/ 2 (0; a) = ^e" 25 

so that we have /(0, 0;a) ^ fi(0;a)f 2 (0;a). Hence, the random variables 
X\,X 2 have dependent probability models. 

(ii) We have Yi = X\ + X 2 , Y 2 — X\ — X 2 . Then, along the line of deriva- 
tion for Theorem 4.1 part (ii), we can again use transformation techniques to 
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express the joint pdf of Y\, Y" 2 as follows: 

h(yi,y2-,oi) =«£/( 2/1, y 2 ; 0,-10, V3, 1,0) 

/ r- \ (63) 

+ (l-a) 5 ^ 1 ,y 2 ;0,10,v/3,l,0j 

for —00 < yi,2/2 < 00. From (6.3), it is obvious that marginally, Y\ is dis- 
tributed as A^(0,3) with its pdf 

hi{yi) = ^^e-^ /8 for - oo < j/i < oo, (6.4) 

whatever be < a < 1. However, the marginal pdf of Y" 2 is given by 

fc 2 ( y2;a) = a 1 e -(»+io) 2 /2 + (1 _ a) l e -(y 2 -io) 2 /2 

v v/2^ 'V^F (6.5) 

for - oo < 2/2 < OO; 

which happens to be a mixture of N(- 10,1) and iV(l 0, 1 ) probability models. 
Obviously, X = \Yi is distributed as iV(0, f ). 

(iii) From (6.3) it is clear that for all < a < 1 , the random variables Y\ , F 2 
are independent, that is X, 5 2 also have independent probability models since 

x = \Y u s 2 = \Ylm 

REMARK 6. 1 It is clear that both X\ , X 2 have identical mixture normal and 
bi-modal probability models governed by the pdf 

P (x;«) = l\-k;e-(*+W + -i- e-(*-5) 2 / 2 



2^" ^ v^F 



for — co < x < oo, 

(6.6) 

when a = ^. 

Remark 6.2 We note that £[Xi] = 5 - 10a, E\X 2 ] = -5 + 10a, 

V[Jfi] = V[X 2 ] = 1 + 100a - 100a 2 . We can also express 

E[X,X 2 \ = a{(l)(l)i + (-25)} + (1 - a){(l)(l)(i) + (-25)} = -f . 

Thus, we have 

Cov{Xi,X 2 ) = E[XiXi] - E[Xi}E[X 2 ] = -f - (5 - 10q)(-5 + 10a) 
= | - 100a + 100a 2 , 

so that pxi,x 2 = {\ ~ 100a + 100a 2 )/(l + 100a - 100a 2 ). That is, the ob- 
servations Xi, Xi are uncorrelated if and only if a — \ — -r^~m « 0.0050253 

or \ + loTS " °- 99497 - 

DATA ILLUSTRATION 6.1 We focus on working under the pdf from (6.1) when 

a = g and compare performances of the Chi-square test, t-tests, and the non- 
parametric test in detecting dependence for (xi,a; 2 ) and (j/i,j/ 2 ) data. Thus, 
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we generated random pairs (xu,X2i),i = 1,...,500(= k) governed by the 
joint probability model (6.1). Subsequently, we obtained (yu,y2i) where 
V\i = x\i + X2i,V2i = Xu - x 2 i,i = l,...,k. The frequency histograms 
for Xi , X2 and y\ , 2/2 are given in Figures 11-12. The plots of x\ vs X2 and y\ 
vs 2/2 are given in Figure 13. 
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Figure 11. Marginal frequency histograms of x\ and X2 obtained from observations with the 
joint distribution (6.1), a = \ 



ft 
10;- 



-r-f-H 



tttJ 



fuo - 










1 

g- 50 - 

1 


r 


-i 






— 


X 


I 


J 


I 







Figure 12. Marginal frequency histograms of y\ = X\ + X2 and yi = xi — £2 obtained from 
(xi,a;2) observations with the joint distribution (6.1), a= | 
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Figure 13. Plots of a;i vs X2 and j/i vs j/2 obtained from (xi, X2) observations with thejoint 
distribution (6.1), a = i 



From Figure 11, we observe that the frequency histograms for both x\, X2 
are fairly similar and these are bi-modal. Refer to Remark 6.1. We also note 
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from Figure 12 that the frequency histogram for y\ resembles the shape of 
the Af(0,3) probability model (6.4), and that for 2/2 resembles the shape of 
^2(2/2; a ) gi ven by (6.5) which is clearly bi-modal. In Figure 13, the scatter 
plot for x\,X2 seem to indicate that x\,X2 data are dependent whereas the 
scatter plot for j/ 1( yi indicates that y\, 3/2 data are independent. 

For a test of significance, however, we formed a 2 x 3 table (Table 7) of count 
data of how many pairs (xu, x^i) fell in each cell and used the Chi-square test 

(1.2). 

Table 7. Frequency and Expected Frequency of (xi , X2) 
In 500 Random Samples 



Xl 


(-oo,3) 


Xl 

[3,5] 


(5,oo) 


Total 


(—00, —5) 
Exp. Freq. 


6(=Oi) 
63.49(= Ei) 


84(= Oa) 
31.49(=E 2 ) 


38(= Oa) 
33.02(= E 3 ) 


128 


[-5,oo) 
Exp. Freq. 


242(= 4 ) 
184.51{= E 4 ) 


39(= Ob) 
91.51(=E 5 ) 


91{= Ob) 
95.98(= E a ) 


372 


Total 


248 


123 


129 


500 



For testing whether the categories based on £1, £2 data are independent at 
5% level, the test statistic from (1.2) is given by 

xlaic = s i=i (0j E- Ej) = 188 - 68 with 2 degrees of freedom; P-value » 
=*> We reject the null hypotheses Ho at 5% level. 

Since the P-value is "small", we reject independence between x\,X2 values 
at 5% level. 

Table 8. Frequency and Expected Frequency of (yi, 2/2) 
In 500 Random Samples 



VI 


(-oo,-10) 


to 

[-10,-5] (5,10] 


(10, 00) 


Total 


(-00, -2) 
Exp. Freq. 


12(=0i) 
16.25(= Ei) 


18(= O2) 
17.29{= E s ) 


19(= Oa) 
17.94(= E 3 ) 


16(= 4 ) 
13.52{= E 4 ) 


65 


[-2,-1] 
Exp. Freq. 


23(- Ob) 
1T.50(= E 5 ) 


17(= Ob) 
18.62(= Ea) 


18(= Or) 

19.32(= Br) 


12(= fl ) 
14.56(= E s ) 


70 


(-1.0] 

Exp. Freq, 


35(= O9) 
31.7S(-E») 


32(= Oio) 
33.78{= E10) 


37(=On) 
35.05(= En) 


23{= O ia ) 
26.42(= E12) 


127 


(0,1) 
Exp. Freq. 


18(= Ois) 
22.00(= E 13 ) 


26(= O14) 
23.4 1(= E14) 


25(= Ova) 
24,29(= E15) 


19(= O w ) 
18.30(= Eia) 


88 


(1.2] 
Exp, Freq. 


23(= On) 

20.75(= E 17 ) 


20(= Oi 8 ) 
22.08(= Eis) 


18{= Oib) 
22.91(= E M ) 


22(= Oao) 
17.26(= E 20 ) 


83 


(2,oo) 
Exp. Freq. 


14(= Oai) 
16.75(=E 2 i) 


20(= Q22) 
17.82(= Baa) 


21 (=023) 

18.49(= E23) 


12(= 24 ) 
13.94(= E 24 ) 


67 


Total 


73 


162 


180 


85 


500 
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Similarly, we formed a 6 x 4 table (Table 8) of count data of pairs (yu,y2i) 
that fell in each cell and proceeded with the Chi-square test (1.2) for the cells. 
For testing whether the categories based on y\ , y 2 data are independent at 5% 
level, the test statistic from (1.2) is given by 

xlaic = s ?=i (0j E E ' )2 = 10 - 223 with 15 degrees of freedom; P-value = 0.80548 
=> We do not reject the null hypotheses Ho at 5% level. 

Thus, we do not reject independence between 2/1,2/2 values at 5% level. In this 
example, we should have expected to see a "large" P-value which we did. 

With regard to t-tests, we respectively found r£ X2 = —0.941 and 
r yim = ~0.015 with associated P-value fa and P-value = 0.734. That 
is, the i-test based on r\f ia , 2 and i"^ liV2 respectively sides with the conclusions 
that the x\, x? data are dependent at 5% level, and that the yi, y 2 data are in- 
dependent at 5% level. See Remark 1.2. 

Next, with regard to the nonparametric test, we found r Xl X2 = —0.624 
and Ty iy2 = —0.013, along with the test statistics z ca i c — y/k — lr Xl X2 fa 
—13.939 and \Jk — lr^ l2/2 « —0.2904 respectively. The associated P-value 
amounts to nearly zero and 0.77151 respectively for the two datasets, indicat- 
ing that we would (would not) reject the hypotheses of independence between 
x\ , X2 values (y\ , 2/2 values) at 5% level. That is, the nonparametric test leads 
to correct inferences in this example for both the X\ y X2 and yi,y2 data. See 
Remark 1.3. 

17.6.1. Another Example 

Here, we list a slightly different example. Instead of (6.1), suppose that X has 
its pdf given by 

f(xi,x 2 ;a) = ap(xi,x 2 ;5,5,(T,cr,.5) + (1 - a)g(xi,x 2 ; 10, 10, a,a, .5) 

(6.7) 
for —00 < xi,X2 < 00, < a < 00, < a < 1. Now, whatever be 
< a < 00, < a < 1, we have the following: 

(0 Xi , X2 are identically distributed with a common mixture normal distribution, 

(ii) Y\ — X\ + X2 is not normally distributed, 

(Hi) Y2 = X\ — X2 is normally distributed with mean zero and variance a 2 . 

Again, we can conclude that 

V ) X(= 5^1) and S 2 (= 5Y2 2 ) are independent random variables, and 
(v) 25 distributed as Chi-square with one degree of freedom. 

17.7 Multivariate Non-Normal Probability Models 

EXAMPLE7.1 Let us recall the n-dimensional normal distribution N n (/j,l, <r 2 £) 
that was used in Section 2. Suppose that the associated pdf is denoted by 
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a n (x; fi, a, p) for xeF. Next, let us define a n-dimensional random vector 
X whose pdf is given by 

/(x;a) = aa n (x; 0,<r,0) + (1 - a)a n (x;0, ^, j^) forx G & n , 

(7.1) 
0<ct<oo,0<q;<1 with some fixed n(> 2). Clearly, each X\ , ..., X n has 
a common non-normal distribution which happens to be a mixture of iV(0, a 2 ) 
and iV(0, \cr 2 ) probability models. 

Now, we may visualize the Helmert variables Y\,Y<2, ...,Y n from (2.1) and 
pretend applying that orthogonal transformation x — * y separately under the 
probability models a n (x;0, a, 0) and a n (x;0, ;%, ^_^\ ) f° r x - From the 
summary results stated in (2.2), under the probability model a n (x;0, a, 0) for 
x, we conclude that Y\, ...,Y n are iid with the common iV(0,cr 2 ) probability 
model. 

Similarly, under the probability model a n (x;0, ■%, w 1 ^ ) for x, we con- 
clude that Yi ~ N(0, a 2 ) and Y 2 , ..., Y n are iid with the common iV(0, g^Efcr 2 ) 
distribution. 

Hence, whatever beO<<r<co,0<Q!<l, once we implement the 
transformation x — > yunder the probability model /(x;a)for x from (7.1), 
we can immediately claim that 

(0 Yi ~ JV(0,<r 2 )sothat X{= j-Yi) ~ N(0, ±<r 2 ), 

(ii) I2, ..., y n are iid with the common pdf which is a mixture of AT(0, <7 2 ) and AT(0, 

g"~g a 2 ) probability models, 
(Hi) Y\ is distributed independently of the random vector (y 2 , ..., Y n ). 

But, refennnr to (2.3) and recall that we can express S 2 = 7^zj\ ^iL2^ 2 wmc h 
clearly implies that 

O'v) X, S 2 have independent probability models. 
EXAMPLE 7.2 Here is another example. Along the line of (7.1), suppose that 
we have a slightly different n-dimensional random vector X whose pdf given 
by 

/(x;a) = aa n (-x\ 0, a, 0) + (1 - a)a n (xi; 0, y/2a, \) (7.2) 

for x € 3t n ,0 <<t<oo,0<Q!<1. Clearly, each X\, ...,X n has a com- 
mon non-normal distribution which happens to be a mixture of JV(0, a 2 ) and 
AT(0,2cr 2 ) probability models. 

Again, we may visualize the Helmert variables Yi, Y-jj ■•■) Y n from (2.1) and 
pretend applying that transformation x — ► y separately under the probability 
models a n (x;0,a, 0) and a n (x;0, V2a, ^) for x. From summary results in 
(2.2), under the probability model a n (x; 0, a, 0) for x, recall that Y\,...,Y n are 
iid with the common N(0,a 2 ) probability model. Also, under the probability 
model a n (x;0, \/2<j, |) for x, we conclude that Y\ ~ N(0, (n + l)cr 2 ) and 
Yz, ..., Y n are iid with the common iV(0, a 2 ) probability model. 



424 RECENTS AD VANCES IN APPLIED PROBABILITY 

Hence, once we implement the transformation x — > y under the probability 
model /(x;o;) for x from (7.2), we can immediately claim that 

(/) Kihas the pdf g(y;a) = d-^er** l(2 °^ + (1 - a) . x ^ e -v 2 /Wn+i)«>) 

for — oo < y < oo, which is a mixture of N(0, a 2 ) and N(0, (n 4- 1)<t 2 ) 
probability models, and 
(ii) Y 2 ,...,Y n wsMN(0,a 2 ). 
Then, obviously we also have 

(Hi) X(— Yi/i/n) has a mixture normal probability model, 
(iv) (n — l)S 2 /cr 2 has the^n-i distribution, 

(v) Y\ is distributed independently ofthe random vector(Y2, ■■-, Y n ) so thatX, S 2 
have independent probability models. 

17.8 Concluding Thoughts 

By allowing the observations X\ , ..., X n to be non-iid or non-normal, we have 
provided a number of specific examples where different scenarios developed 
with regard to dependence or independence between the sample mean X and 
the sample variance S 2 . In these examples, we assigned fixed values for some 
ofthe "parameters" primarily because they made the analyses simpler and 
yet they drove the point home. In Sections 4-7, we could clearly envision 
population models f{x\,X2) (or /(x)) defined as mixtures of three or more 
appropriate bivariate (or multivariate) normal probability models instead of 
focusing only on mixtures of simply two bivariate (or multivariate) normal 
probability models time after time. But, we must admit that we have deliber- 
ately stayed away from "generalizing" the examples too much because such 
additional frills, in our opinion, will harm both beauty and simplicity ofthe 
message. 

Five major illustrations through simulated data have been provided where 
we applied the customary t-test based on Pearson-sample correlation coeffi- 
cient as well as the traditional nonparametric test based on Spearman-rank 
correlation coefficient and the Chi-square test to "validate" independence or 
dependence between the two variables under consideration. We have included 
the t-test because practitioners often rely upon some routine statistical pack- 
ages to come up with Pearson-sample correlation coefficient and the associated 
i-test with the intent to check "dependence" or "association" for paired data. 
A succinct summary of our findings follows. 

Data Illustration 1.1: X,S 2 were independent. The Chi-square and t-tests did 
not side against the correct conclusion that the x, s 2 data were independent. 
The nonparametric test came up with a wrong conclusion. 
Data Illustration 1.2: X, S 2 were dependent. The Chi-square test, i-test, and 
nonparametric test sided with the correct conclusion that the x, s data were 
dependent. 



Dependence or Independence... 425 

Data Illustration 4.1 : X\,X2 were dependent with PXi,x 2 = 0. The Chi- 
square and t -tests sided with the correct conclusion that the xi,X2 data were 
dependent. The nonparametric test came up with a wrong conclusion. 

Y\ , Y<2 were dependent with py x ,y 2 = 0. The Chi-square test sided with 
the correct conclusion that the y\ , j/2 data were dependent. Both t-test and 
nonparametric test came up with wrong conclusions. 

Data Illustration 5.1: Xi,X2weve dependent with px x ,x 2 = —0.098058. The 
Chi-square and t-tests sided with the correct conclusion that the x\ , X2 data 
were dependent. The nonparametric test came up with a wrong conclusion. 

Yi,Y2 were dependent with pY lt Y 2 = —0.73497. The Chi-square test, 
t-test, and nonparametric test sided with the correct conclusion that the yi , 2/2 
data were dependent. 

Data Illustration 6.1: Xi , X2 were dependent with px x ,x 2 = —0.94231. The 
Chi-square test, t-test, and nonparametric test sided with the correct conclusion 
that the y\ , yz data were dependent. 

Y\ , I2 were independent. The Chi-square test, t-test, and nonparametric test 
did not side against the correct conclusion that the j/i, 1/2 data were indepen- 
dent. 

From this summary, it is clear that in some instances the t-test and the non- 
parametric test behaved erratically in their "validation" of independence or 
dependence in question. In a number of occasions, the t-test and the nonpara- 
metric test unfortunately arrived at conflicting conclusions based on same data. 
When we had PXi,x 2 or PYi,Y 2 significantly away from zero, we noted correct 
decisions regardless of which test was used for the x\, X2 and j/i, yi data. On 
the other hand, whenever we found that PXi,x 2 or PYi,Y 2 was zero or nearly 
zero, we noted that these tests using the xi,x% and 2/1,2/2 data gave mixed 
signals. We realize that if the paired data were independent, then p would be 
zero, whereas even if the paired data were dependent, again p might be zero or 
nearly zero. Given this, the present investigation raises the potential of a major 
problem in implementing either a t-test or the nonparametric test as EDA tools 
to examine dependence or association for paired data in practice! 

The Chi-square test, however, correctly validated dependence under consid- 
eration in every case, and the same test never sided against the correct con- 
clusion that the paired data were independent when the paired variables were 
in fact independent. This exercise suggests that among three contenders, the 
Chi-square test is certainly more reliable. 
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Abstract The first part of this paper summarizes the essential facts on general optimal 

stopping theory for time-homogeneous diffusion processes in R ra . The results 
displayed are stated in a little greater generality, but in such a way that they are 
neither too restrictive nor too complicated. The second part presents equations 
for the value function and the optimal stopping boundary as a free-boundary 
(Stefan) problem and further presents the principle of smooth fit. This part is 
illustrated by examples where the focus is on optimal stopping problems for the 
maximum process associated with a one-dimensional diffusion. 



18.1 Introduction 

This paper reviews some methodologies used in optimal stopping problems 
for diffusion processes in R n . The first aim is to give a quick review of the 
general optimal stopping theory by introducing the fundamental concepts of 
excessive and superharmonic functions. The second aim is to introduce the 
common technique to transform the optimal stopping into a free-boundary 
(Stefan) problem, such that explicit or numerical computations of the value 
function and the optimal stopping boundary are possible in specific problems. 

Problems of optimal stopping have a long history in probability theory and 
have been widely studied by many authors. Results on optimal stopping were 
first developed in the discrete case. The first formulations of optimal stopping 
problems for discrete time stochastic processes were in connection with se- 
quential analysis in mathematical statistics, where the number of observations 
is not fixed in advance (that is a random number) but terminated by the be- 
haviour of the observed data. The results can be found in [Wald, 1947]. [Snell, 
1952] obtained the first general results of optimal stopping theory for stochas- 
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tic processes in discrete time. For a survey of optimal stopping for Markov 
sequences see [Shiryaev, 1978] and the references therein. The first general 
results on optimal stopping problems for continuous time Markov processes 
were obtained by [Dynkin, 1963] using the fundamental concepts of excessive 
and superharmonic functions. There is an abundance of work in general opti- 
mal stopping theory using these concepts, but one of the standard and master 
reference is the monograph of [Shiryaev, 1978] where the definite results of 
general optimal stopping theory are stated and it also contains an extensive list 
of references to this topic. (Another thorough exposition is founded in [Karoui, 
1981]). This method gives results on the existence and uniqueness of an op- 
timal stopping time, under very general conditions, of the gain function and 
the Markov process. Generally, for solving a specific problem the method is 
very difficult to apply. In a concrete problem with a smooth gain function and 
a continuous Markov process, it is a common technique to formulate the opti- 
mal stopping problem as a free-boundary problem for the value function and 
the optimal stopping boundary along with the non-trivial boundary condition 
the principle of smooth fit (also called smooth pasting ([Shiryaev, 1978]) or 
high contact principle ([0ksendal, 1998])). The principle of smooth fit says 
that the first derivatives of the value function and the gain function agree at the 
optimal stopping boundary (the boundary of the domain of continued observa- 
tion). The principle was first applied by [Mikhalevich, 1958] (under leadership 
of Kolmogorov) for concrete problems in sequential analysis and later inde- 
pendently by [Chernoff, 1961] and [Lindley, 1961]. [McKean, 1965] applied 
the principle to the American option problem. Other important papers in this 
respect are [Grigelionis & Shiryaev, 1966] and [van Moerbeke, 1974]. For a 
complete account of the subject and an extensive bibliography see [Shiryaev, 
1978]. [Peskir, 2000] introduced the principle of continuous fit solving se- 
quential testing problems for Poisson processes (processes with jumps). 

The background for solving concrete optimal stopping problems is the fol- 
lowing. Before and in the seventies the investigated concrete optimal stopping 
problems were for one-dimensional diffusions where the gain process con- 
tained two terms: a function of the time and the process, and a path-dependent 
integral of the process (see, among others, [Taylor, 1968], [Shepp, 1969] and 
[Davis, 1976]). In the nineties the maximum process (path-dependent func- 
tional) associated with a one-dimensional diffusion was studied in optimal 
stopping. [Jacka, 1991] treated the case of reflected Brownian motion and 
later [Dubins et al, 1993] treated the case of Bessel processes. In both papers 
the motivation was to obtain sharp maximal inequalities and the problem was 
solved by guessing the nature of the optimal stopping boundary. [Graversen 
& Peskir, 1998] formulated the maximality principle for the optimal stopping 
boundary in the context of geometric Brownian motion. [Peskir, 1998] showed 
that the maximality principle is equivalent to the superharmonic characteriza- 
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tion of the value function from the general optimal stopping theory and led to 
the solution of the problem for a general diffusion ([Peskir, 1998] also contains 
many references to this subject). In recent work, Graversen, [Graversen, Peskir 
& Shiryaev, 2001] formulated and solved an optimal stopping problem where 
the gain process was not adapted to the filtration. 

Optimal stopping problems appear in many connections and have a wide 
range of applications from theoretical to applied problems. The following ap- 
plications illustrate this point. 

Mathematical finance 

The valuation of American options is based on solving optimal stopping 
problems and is prominent in the modern optimal stopping theory. The liter- 
ature devoted to pricing American options is extensive; for an account of the 
subject see the survey of Myneni [Myneni, 1992] and the references therein. 
The most famous result in this direction is that of McKean [McKean, 1965] 
solving the standard American option in the Black-Scholes model. This exam- 
ple can further serve to determine the right time to sell the stocks ([0ksendal, 
1998]). In [Shepp & Shiryaev, 1993] the valuation of the Russian option is 
computed in the Black-Scholes model (see Example 7). The payoff of the 
option is the maximum value of the asset between the purchase time and the 
exercise time. 

Optimal prediction 

The development of optimal prediction of an anticipated functional of a con- 
tinuous time process was recently initiated in [Graversen, Peskir & Shiryaev, 
2001] (see Example 8). The general optimal stopping theory cannot be ap- 
plied in this case since, due to the anticipated variable, the gain process is 
not adapted to the filtration. The problem under consideration in [Graversen, 
Peskir & Shiryaev, 2001] is to stop a Brownian path as close as possible to 
the unknown ultimate maximum height of the path. The closeness is measured 
by a mean-square distance. This problem was extended in [Pedersen, 2003] to 
cases where the closeness is measured by a L q distance and a probability dis- 
tance. These problems can be viewed as an optimal decision that needs to be 
based on a prediction of the future behaviour of the observable motion. For ex- 
ample, when a trader is faced with a decision on anticipated market movements 
without knowing the exact date of the optimal occurrence. The argument can 
be carried over to other applied problems where such a prediction plays a role. 

Sharp inequalities 

Optimal stopping problems are a natural tool to derive sharp versions of 
known inequalities, as well as to deduce new sharp inequalities. By this method 
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Davis [Davis, 1976] derived sharp inequalities for a reflected Brownian mo- 
tion. [Jacka, 1991] and [Dubins et al, 1993] derived sharp maximal inequali- 
ties for a reflected Brownian motion and for Bessel processes, respectively. In 
the same direction see [Graversen & Peskir, 1997] and [Graversen & Peskir, 
1998a] (Doob's inequality for Brownian motion and Hardy-Littlewood inequal- 
ity, respectively) and [Pedersen, 2000] (Doob's inequality for Bessel processes). 

Mathematical statistics 

The Bayesian approach to sequential analysis of problems on testing two 
statistical hypotheses can be solved by reducing the initial problems to optimal 
stopping problems. Testing two hypotheses about the mean value of a Wiener 
process with drift was solved by [Mikhalevich, 1958] and [Shiryaev, 1969]. 
Peskir & Shiryaev [Peskir, 2000] solved the problem of testing two hypotheses 
on the intensity of a Poisson process. Another problem in this direction is the 
quickest detection problem (disruption problem). Shiryaev [Shiryaev, 1961] 
investigated the problem of detecting (alarm) a change in the mean value of a 
Brownian motion with drift with a minimal error (false alarm). Again, a thor- 
ough exposition of the subject can be found in [Shiryaev, 1978]. 

The remainder of this paper is structured as follows. The next section in- 
troduces the formulation of the optimal stopping problem under consideration. 
The concepts of excessive and superharmonic functions with some basic re- 
sults can be found in Section 18.3. The main theorem on optimal stopping of 
diffusions is the point of discussion in Section 18.4. In Section 18.5, the op- 
timal stopping problem is transformed into a free-boundary problem and the 
principle of smooth fit is introduced. The paper concludes with some exam- 
ples in Section 18.6, where the focus is on optimal stopping problems for the 
maximum process associated with a diffusion. 

18.2 Formulation of the problem 

Let (Xt)t>o be a time -homogeneous diffusion process with state space R n 
associated with the infinitesimal generator 

for x €M. n where fj, : M. n — » R n and a : R n — > M nxm are continuous and 
further era* is non-negative definite. See [0ksendal, 1998] for conditions on 
//(■) and a(-) that ensure existence and uniqueness of the diffusion process. 
Let (Z t ) be a diffusion process depending on both time and space (and hence 
is not time-homogeneous diffusion) given by (Zt) = (t,X t ) whichunder P z 
starts at z = (t, x) . Thus (Zt) is a diffusion process in K + xR n associated 
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with the infinitesimal generator 

d _ 
L z = - + L x 

for z = (t, x)eR+xR n . 

The optimal stopping problem to be studied in later sections is of the follow- 
ing kind. Let G : R+ xl n ->R be a gain function, which will be specified 
later. Consider the optimal stopping problem for the diffusion {Zt) with the 
valuefunction given by 

V*{z) = sup E Z {G{Z T )) (2.1) 

r 

where the supremum is taken over all stopping times r for {Zt) ■ At the 
elements u> € Q, where t{uj) = oo set G{Z T ) to be — oo . There are 
two problems to be solved in connection with the problem (2.1). The first 
problem is to compute the value function V* and the second problem is to 
find an optimal stopping time r* , that is, a stopping time for {Zt) such that 
V„{z) = E z (G{Z Tm )) . Note that optimal stopping times may not exist, or be 
unique if they do. 

18.3 Excessive and superharmonic functions 

This section introduces the two fundamental concepts of excessive and super- 
harmonic functions that are the basic concepts in the next section for a char- 
acterization of the value function in (2.1). For the facts presented here and a 
complete account (including proofs) of this subject, consult [Shiryaev, 1978]. 
In the main theorem in the next section it is assumed that the gain function 
belongs to the following class of functions. Let C{Z) be the class consisting 
of all lower semicontinuous functions H : R+ x R n — ► (— oo, oo] satisfying 
either of the following two conditions 

E z {sup s > H{Z S )) <oo (3.1) 

E z { inf s > H{Z S )) > -oo (3.2) 

for all z = {t,x) . If the function H is bounded from below then condition 
(3.2) is trivial fulfilled. The following two families of functions are crucial in 
the sequel presentation of the general optimal stopping theory. 

Definition 1 (Excessive functions). A function H € C{Z) is called ex- 
cessive for {Zt) if 

E Z {H{Z S )) < H{z) 

for all s > and all z = {t, x) . 
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DEFINITION 2 (Superharmonic functions). A function H G C(Z) is 
called superharmonic for (Z t ) if 

E Z (H(Z T )) < H(z) 

for all stopping times r for (Zt) and all z = (t, x) . 

The basic and useful properties of excessive and superharmonic functions 
are stated in [Shiryaev, 1978] and [0ksendal, 1998]. It is clear from the two 
definitions that a superharmonic function is excessive. Moreover, in some 
cases, the converse also holds - which is not obvious. The result is stated 
in the next proposition. 

PROPOSITION 1 Let H e C(Z) satisfy condition (3.2). Then H is exces- 
sive for (Zt) if and only if H is superharmonic for (Zt) . 

The above definitions play a definite role in describing the structure of the 
value function in (2.1). The following definition is important in this direction. 

Definition 3 (The least superharmonic (excessive) majorant,). Let G 6 

C(Z) be finite, A superharmonic (excessive) function H is called a super- 
harmonic (excessive) majorant of G if H > G .A function G is called the 
least superharmonic (excessive) majorant of G if 

(0 G is a superharmonic (excessive) majorant of G , 

(ii) Lf H is an arbitrary superharmonic (excessive) majorant of G then 
G<H. 

To complete this section, a general iterative procedure is presented for con- 
structing the least superharmonic majorant under the condition (3.2). 

PROPOSITION 2 Let G € C(Z) satisfy condition (3.2) and G < oo .Define 
the operator 

Q j [G](z) = G(z)VE z (G(Z 2 - j )) 

and set 

G j<n (z) = Q][G](z) 

where Qj is the n'te power of the operator Q„- .Then the function 

G(z) = lim lim Gj n (z) 

j— >oon— »oo 

is the least superharmonic majorant of G . 

There is a simple iterative procedure for the construction of } G when the 
Markov process and the gain function are "nice". 
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COROLLARY 1 Let (Zt) be a Feller process and let G € C{Z) be continu- 
ous and bounded from below. Set 

G j (z) = sup E, ((?,•_! (Z t )) 

t>0 
for j > 1 and Go = G . Then 

G{z) = lim Gj(z) 

is the least superharmonic majorant of G . 

REMARK 1 Proposition 2 and Corollary 1 are both valid under condition 
(3.2) and excessive and superharmonic functions are the same in this case, 
according to Proposition 1. When condition (3.2) is violated, the least exces- 
sive majorant may differ from the least superharmonic majorant. In this case, 
the least excessive majorant is smaller than the least superharmonic majorant, 
since there are more excessive functions than superharmonic functions. The 
construction of the least superharmonic majorant follows a similar pattern but 
is generally more complicated (see [Shiryaev, 1978]). 

REMARK 2 The iterative procedures to construct the least superharmonic 
majorant are difficult to apply to concrete problems. This makes it necessary 
to search for explicit or numerical computations of the least superharmonic 
majorant. 

18.4 Characterization of the value function 

The main theorem of general optimal stopping theory of diffusion processes is 
contained in the next theorem. The result gives existence and uniqueness of an 
optimal stopping time in problem (2.1). The result could have been stated in a 
more general setting, but is stated with a minimum of technical assumptions. 
For instance, the theorem also holds for a larger class of Markov process such 
as Levy processes. For details of this and the main theorem consult [Shiryaev, 
1978]. 

THEOREM 1 Consider the optimal stopping problem (2.1) where the gain 
function G is lower semicontinuous and satisfies either (3.1) or (3.2). 

(I). The value function V* is the least superharmonic majorant of the gain 
function G with respect to the process (Zt)t>o i that is, 

V,(z) = G(z) 

for all z = (t, x) . 

(II). Define the domain of continued observation 

C = { z e M+ x R n | G(z) < V*{z) } 
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and let r* be the first exit time of (Z t ) from C , that is, 

T„ = inf{i>0 : Z t £C} . 

If r* < oo P z -a. s. for all z , then r* is an optimal stopping time for the 
problem (2.1), at least when G is continuous and satisfies both (3.1) and (3.2). 
(III). If there exists an optimal stopping time a in problem (2.1), then 
t* < a P z -a.s.for all z and r* is also an optimal stopping time for problem 
(2.1). 

REMARK 3 Part (11) of the theorem gives the existence of an optimal stopping 
time. The conditions could have been stated with a little greater generality; 
again, for more details cf [Shiryaev, 1978]. 

Part (III) of the theorem says that if there exists an optimal stopping time 
a then r* is also an optimal stopping time and is the smallest among all 
optimal stopping times for problem (2. 1). This extremal property of the optimal 
stopping time r* characterizes it uniquely. 

REMARK 4 Sometimes it is convenient to consider "approximate" optimal 
stopping times. An example is given in the setting of Theorem 1 (II), if the 
stopping time r* does not satisfy r* < oo P z -a.s. Then the following 
approximate stopping times are available. For e > let C e = {z € 
R + X R" | G(z) < V*(z) — s . Let r e be the first exit time of (Z t ) from C £ , 
that is, T e = inf{£ > : Zt ^ C e . Then r e < oo P z -a.s. and T e is 
approximated optimal in the following sense lim e |o ^z(G(Z Te )) — V*{z) for 
all z = (t, x). Furthermore, t £ j T* as e J, . 

At first glance, it seems that the initial setting of the optimal stopping prob- 
lem (2.1) and Theorem 1 only cover the cases where the gain process is a 
function of time and the state of the process (Xt) . But the next two exam- 
ples illustrate that Theorem 1 also covers some cases where the gain process 
contains path-dependent functional of (Xt) , where it is a matter of properly 
defining (Z t ) . 

For simplicity, let n = 1 in the examples below and assume, moreover, 
that (Xt) solves the stochastic differential equation 

dX t = ii(X t )dt + a(X t )dB t 

where (Bt) is a standard Brownian motion. 

EXAMPLE 1 (Optimal stopping problems involving an integral). Let F : 

R + xR-tR and c : M — > R+ be continuous functions. Consider the 
optimal stopping problem 

W* (t, x) = sup E x (F(t + r,X r )- f c(X u ) du) . (4. 1) 
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The integral term might be interpreted as an accumulated cost. This problem 
can be reformulated to fit in the setting of problem (2,1) and Theorem 1 by the 
following simple observations. 

Set A t = J Q c(X u )du and denote (Z t ) by Z t = (t,X t ,A t ) . Thus (Zt) 
is a diffusion process in R 3 associated with the infinitesimal generator 

9 T I S d 

Lz = ^+L x -c(x)- 

for z — (t,x,a) . LetG(z) = F(t,x) — a be a gain function and consider 
the new optimal stopping problem 

V*(z) = S upE z (G(Z T )). 

T 

This problem fits into the setting of Theorem 1 and it is clear that W*(t, x) = 
V*(i, x, 0) . Note that the gainfunction G is linear in a 

Another approach is by ltd formula to reduce the problem (4,1) to the setting 
of the initial problem (2.1). Assume that the function x t— > D(x) is smooth 
and satisfies hxD(x) = c(x) . ltd formula yields that 

D{X t ) = D{x) + f L x D(^u) du + M t 
Jo 

where M t — J D'(X U ) a(X u ) dB u is a continuous local martingale. The 
optional sampling implies that E x (M r ) =0 (by localization and some 
uniform integrable conditions) and hence 

E x (D(X T )) = D(x) + E x ( f T c(X u ) du) . 

Therefore, the problem (4.1) is equivalent to solving the initial problem (2.1) 
with the gain function G(t,x) = F(t,x) — D{x) . 

EXAMPLE 2 (Optimal stopping problems for the maximum process,). 

Peskir [Peskir, 1998] made the following observation. Denote the maximum 
process associated with (Xt) by St = maxo<u<tX u . It can be ver- 
ified that the two-dimensional process {Zt) — (Xt,St) with state space 
{(x,s) € R \x < s} (see Figure 1) is a continuous Markov process as- 
sociated with the infinitesimal generator 



Lz = Lx for x < s 

-?r = for x — s 
ds J 
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max < r <t B T 




B, 



Figure 1. A simulation of a path of the two-dimensional process 

(St(w),maxo<u<t jBu(w)) where (Bt) is a Brownian motion. 

with Lx given in Section 18.2. Hence the optimal stopping problem 
V*(x, s) = sup E X , S (G(X T , S T )) 

T 

for x < s Jits in the setting of Theorem 1. 



18.5 The free-boundary problem and the principle of 
smooth fit 

For solving a specific optimal stopping problem the superharmonic charac- 
terization is not easy to apply. To carry out explicit computations of the value 
function another methodology therefore is needed. This section considers the 
optimal stopping problem as a free-boundary (Stefan) problem. This is also 
important for computations of the value function from a numerical point of 
view. First, the notation of characteristic generator (see [0ksendal, 1998]) 
is introduced and is an extension of the infinitesimal generator. Let (Zt) be 
the diffusion process given in Section 18.2. For any open set U C R + x W 1 , 
associate iy — inf { t > : Zt £ U } to be the first exit time from U of 
(Zt). 

DEFINITION 4 (Characteristic generator). The characteristic generator A% 
of {Zt) is defined by 



Mf(z) = lim 



E z (f(Z Tu ))-f(z) 
E z (tu) 



where the limit is to be understood in the following sense. The open sets 
Uj decrease to the point z, that is, Uj + i C Uj and C\j>\Uj = {z} . If 
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E z (t(/) = oo for all open sets U 3 z , then set Azf(z) = . Let T>{Az) 
be the family of B or el functions f for which the limit exists. 

REMARK 5 As already mentioned above the characteristic generator is an 
extension of the infinitesimal generator in the following sense that "D(Lz) Q 
T>{Az) and Lz/ = Azf for any f G T>(Lz) ■ 

Assume in the sequel that the value function V* in (2.1) is finite. Let 
C — {z € M+ x R n | V*(z) > G(z) } be the domain of continued observa- 
tion (see Theorem 1). Then the following result gives equations for the value 
function in the domain of continued observation. 

THEOREM 2 Let the gain function G be continuous and satisfy both condi- 
tions (3.1) and (3.2). Then the value function V*(z) for z E C belongs to 
T>(Az) and solves the equation 

A z V*(z) = (5.1) 

for z EC 

REMARK 6 Since the gain function G is continuous and the value function 
V* is lower semicontinuous, the domain of continued observation C is an 
open set in R+ x W 1 . I/tq < oo P z -a.s. then it follows from Theorem 1 that 

V*(z) = E z (G(Z TC )). 

Then the general Markov process theory yields that the value function solves 
the equation (5.1) and Theorem 2 follows directly. In other words, one is led 
to formulate equation (5.1). 

If the value function is C 2 in the domain of continued observation, the char- 
acteristic generator can be replaced by the infinitesimal generator according to 
Remark 5. This has the advantage that the infinitesimal generator is explicitly 
given. 

Equation (5.1) is referred to as a free-boundary problem. The domain of 
continued observation C is not known a priori but must be found along with 
unknown value function V* . Usually, a free-boundary problem has many so- 
lutions and further conditions must be added (e.g. the principle of smooth fit) 
which the value function V* satisfies. These additional conditions are not 
always enough to determine V* . In that case, one must either guess or find 
more sophisticated conditions (e.g. the maximality principle, see Example 5 in 
the next section). 

The famous principle of smooth fit is one of the most frequently used non- 
trivial boundary conditions in optimal stopping. The principle is often applied 
in the literature (see, among others, [McKean, 1965], [Jacka, 1991] and [Du- 
bins et al, 1993]). 
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The principle of smooth fit 

If the gain function G is smooth then a non-trivial boundary condition for 
the free-boundary problem for i = 1, . . . , n might be the following 



at 
av* 

dxi 



f \ dG < \ 

( \ dG < \ 



zeac dx 



zedc 



zedc 



A result in [Shiryaev, 1978] states that the principle of smooth fit holds 
under fairly general assumptions. The principle of smooth fit is a very fine 
condition in the sense that the value function often is often precisely C 1 at 
the boundary of the domain of continued observation. This is demonstrated in 
the examples in the next section. 

The above results can be used to formulate the following method for solving 
a particular stopping problem. 

A recipe to solve optimal stopping problems 

Step 1. First one tries to guess the nature of the optimal stopping boundary and 
then, by using ad hoc arguments, to formulate a free-boundary prob- 
lem with the infinitesimal generator and some boundary conditions. The 
boundary conditions can be trivial ones (e.g. the value function is contin- 
uous, odd/even, normal reflection etc.) or non-trivial, such as the princi- 
ple of smooth fit and the maximality principle. 

Step 2. One solves the formulated free-boundary system and maximizes over 
the family of solutions if there is no unique solution. 

Step 3. Finally, one must verify that the guessed at candidates for the value func- 
tion and the optimal stopping time are indeed correct, (e.g., using Ito 
formula). 

The methodology has been used in, among others, [Dubins et al, 1993], 
[Graversen & Peskir, 1998], [Pedersen, 2000] and [Shepp & Shiryaev, 1993]. 

It is generally difficult to find the appropriate solution of the (partial) differ- 
ential equation hzV(z) = . It is therefore of most interest to formulate the 
free-boundary problem such that the dimension of the problem is as small as 
possible. The two examples below present cases where the dimension can be 
reduced. For simplicity let n = 1 and assume, moreover, that (Xt) solves 
the stochastic differential equation 

dX t = n{X t )dt + <j(X t )dB t 
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where (f? t ) is a standard Brownian motion. 

EXAMPLE 3 (Integral and discounted problem). The general Markov pro- 
cess theory states that the free-boundary problem is one-dimensional in some 
special cases, 

1. Let F : R — > R and c : M. —* R+ be continuous functions and let 
the gainfunction be given by G(x, a) = F(x) — a which is linear in a (see 
Example 1). Let (Z t ) = (Xt, At) where A t = f Q c(X u )du and consider 
the two-dimensional optimal stopping problem 

V,(x) = sup E*(F(X T ) - j T c(X u )du) . 

At first glance, it seems to be a two-dimensional problem, but the Markov pro- 
cess theory yields that the free-boundary problem can formulated as 

LxK(x) = -c(x) 

for x in the domain of continued observation, which is also clear from the last 
part of Example I. This is a one-dimensional problem. 

2. Given the gain function G(t, x) = e~^ F(x) where A > is a 
constant. Let (Zt) = (t,Xt) and consider the "two-dimensional" optimal 
stopping problem 

V*{x) = supE x (e- XT F{X T )). 

T 

In this case, the free-boundary problem can be formulated as 

L x V*(x) = \V*(x) 

for x in the domain of continued observation. Again, this is a one-dimensional 
problem. 

example 4 (Deterministic time-change method/ This example uses a de- 
terministic time-change to reduce the problem. The method is described in 
[Pedersen & Peskir, 2000]. Consider the optimal stopping problem 

K(t, x) = sup E x (a(t + t) X t ) 

T 

where a is a smooth non-linear function. Thus, the value function V* might 
solve the following partial differential equation 

dV* 

-^-(t,x) + L x K(t,x) = 

for (t, x) in the domain of continued observation. 
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The time-change method transforms the original problem into a new optimal 
stopping problem, such that the new value function solves an ordinary differ- 
ential equation. The problem is to find a deterministic time-change t >—> o~t 
which satisfies following two conditions: 

(i) t*-*o~t is continuous and strictly increasing, 

(ii) There exists a one-dimensional time-homogeneous diffusion (Yt) with 
infinitesimal generator Ly such that a(at) X at — e~ Yt for some 
X&R. 

The condition (i) ensures that r is a stopping time for (Y t ) if and only 
if o~ T is a stopping time for (Xt) ■ Substituting (ii) in the problem, the new 
(time-changed) value function becomes 

W*(y) = supE y (e- AT r T ). 

T 

As in Example 3 the new problem might solve the ordinary differential equation 

L Y W*(y) = \W*(y) 

in the domain of continued observation. Given the diffusion (Xt) the crucial 
point is to find the process (Yt) and the time-change a t fulfilling the two 
conditions above. By ltd calculus it can be shown that the time-change given 

by 

a t = inf I r > / p(u) du > t \ 

where p(-) satisfies that the two terms 

{^ y + a{t) ^ v/(3{t ^)pjT) and aitf^iy /<*(*)) ~^ 

do not depend on t , will fulfill the above two conditions. This clearly imposes 
the following conditions on a(-) to make the method applicable 

»(y/a(t))= 1 (t)G 1 (y) and a 2 (y/ a (t)) = ^G 2 (y) 

where j(t) , G\(y) and G 2 (y) are functions required to exist. For more in- 
formation and remaining details of this method see [Pedersen & Peskir, 2000] 
(see also [Graversen, Peskir & Shiryaev, 2001]). 
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18.6 Examples and applications 

This section presents the solutions of three examples of stopping problems 
which illustrate the method established in the previous section and some ap- 
plications. The focus will be on optimal stopping problems for the maximum 
process associated with a one-dimensional diffusion. 

Let n = 1 . Assume that (X t ) is a non-singular diffusion with state space 
R , that is x t-> a(x) > and (Xt) solves the stochastic differential equation 

dX t = t i(X t )dt + a(Xt)dB t 

where (Bt) is a standard Brownian motion. The infinitesimal generator of 
(X t ) is given by 

L x ^^(x)-^ + la 2 (x)~. (6.1) 

Let St = maxo< u <(^M V s denote the maximum process associated with 
(Xt) and let it start at s > x under P X]S . The scale function and speed 
measure of (X t ) are given by 

f x ( f u ^( r ) \ 2 

S(x) = / exp I — 2 / dr du and m(dx) = — dx 

Jo V Jo ° 2 (r) J S'(x)(T 2 (x) 

for x € R . 

The first example is important from the general optimal stopping theory 
point of view. 

EXAMPLE 5 (The maximality principle). The results of this example are 
found in [Peskir, 1998], Let x \-* c(x) > be a continuous (cost) function. 
Consider the optimal stopping problem with the value function 

V*(x, s) = sup E x>s (s T - I c(X u ) du) (6.2) 

where the supremum is taken over all stopping times r for (Xt) satisfying 

E XiS ( / c(X u ) du) < co (6.3) 

for all x < s . The recipe from the previous section is applied to solve the 

problem. 

1. The process (Xt, St) with state space { (x, s) € M 2 | x < s } changes 
only in the second coordinate when it hits the diagonal x = s in R 2 (see 
Figure I). It can be shown that it is not optimal to stop on the diagonal. Due 
to the positive cost function c(-) the optimal stopping boundary might be a 

function which stays below the diagonal. Thus, the stopping time might be on 
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the form r* = inf {t > : X t < g*(St) } for some function s t-» g*(s) < s 
to be found. In other words, the domain of continued observation is on the 
form C = { (x, s) 6 R 2 \g*(s) < x < s } . It is now natural to formulate 
the following free-boundary problem that the value function and the optimal 
stopping boundary is a solution of 

lixV(x, s) = c(x) for g(s) < x < s with s fixed (6.4) 

dV 



ds {x ' s) 



V(x,s) 



= (normal reflection) (6.5) 

= s (instantaneous stopping) (6.6) 



dV 
Ix-^ S) 



x=g(s)+ 

(smooth fit) . (6.7) 



x=g(s)+ 

Note that (6.4) and (6.5) follow from Example 2 and Example 3. The condition 
(6.6) is clear and since the setting is smooth the principle of smooth fit should 
be satisfied, that is condition (6. 7) holds. (The theorem below shows that the 
guessed system is indeed correct). 
2. Define the function 

V 9 (x, s) = s + J (S(x) - S(u))c(u) m{du) (6.8) 

•Ms) 

for g(s) < x < s and set V g (x,s) = s for x < g(s) . Further, define the 
first order non-linear differential equation 

_ o-\g{s))S>{g{s)) 

9 {S) 2c(g(s)) (S(s) - S(g(s))) ' ^ 

For a solution s •— ► g(s) < s of equation (6.9) the corresponding function 
Vg(x,s) in (6.8) solves the free-boundary problem in the region g(s) < x < 
s . 

The problem now is to choose the right optimal stopping boundary s t— ► 
g(s) < s . To do this a new principle is needed and it will be the maximality 
principle. The main observations in [Peskir, 1998] are the following. 

(i) g *—> Vg{x,s) is increasing. 

(ii) The function (x, s,a) i— » V g (x, s) — a is superharmonicfor the Markov 
process (Z t ) = (X t ,St,At) (for stopping times r satisfying (6.3)) 
where At = J* c(X u ) du + a . 

The superharmonic characterization of the value function in Theorem 1 and 
the above two observations lead to the formulation of the following principle 
for determining the optimal stopping boundary. 
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The maximality principle 

The optimal stopping boundary s i— ► g*(s) for the problem (6.2) is the 
maximal solution of the differential equation (6,9) which stays strictly below 
the diagonal in R 2 (and is simply called the maximal solution in the sequel). 

3. In [Peskir, 1998] it was proved that this principle is equivalent to the 
superharmonic characterization of the value function. The result is formulated 
in the next theorem and is motivated by Theorem 1. 

THEOREM 3 Consider the optimal stopping problem (6.2). 

(I). Let s y-> g*(s) denote the maximal solution of (6.9) which stays below 
the diagonal in R 2 . Then the value function is given by 



fs+[ X (S(x)-S(t 

„(x,a)=< Jg,(s) 



, {u))c{u)m{du) for g*(s) < x < s 



for x < g*(s) 



(II). The stopping time r* = inf {t > : Xt < g*(St)} is optimal 
whenever it satisfies condition (6.3). 

(HI). If there exists an optimal stopping time a in (6.2) satisfying (6.3), 
then r* < a P x ,s-o.s. for all (x, s) , and r* is also an optimal stopping time. 

(IV). If there is no maximal solution of '(6.9) which stays strictly below the 
diagonal in R 2 , then Vt,(x, s) = oo for all (x, s) , and there is no optimal 
stopping time. 

For more information and details see [Peskir, 1998]. A similar approach 
was used in [Pedersen & Peskir, 1998] to compute expectation ofAzema-Yor 
stopping times. 

The theorem extends to diffusions with other state spaces in R . The non- 
negative diffusion version of the theorem is particularly interesting to derive 
sharp maximal inequalities, which will be applied in the next example. 

Peskir [Peskir, 1998] conjectured that the maximality principle holds for the 
discounted version of problem (6.2). In Shepp & Shiryaev [Shepp & Shiryaev, 
1993] and Pedersen [Pedersen, 2000a] the principle is shown to hold in spe- 
cific cases. A technical difficulty arises in verifying the conjecture because the 
corresponding free-boundary problem may have no simple solution and the 
(optimal) boundary function is thus implicitly defined. 

EXAMPLE 6 (Doob's inequality for Brownian motion). This example is an 
application of the previous example (see also [Graversen & Peskir, 1997]). 
Consider the optimal stopping problem (6.2) with Xt = \B t + x\ p and 
c(x) = cx( p ~ 2 >' p for p > 1 . Then (X t ) is a non-negative diffusion having 
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as an instantaneously reflecting boundary point and the infinitesimal generator 
of (X t ) in (0,oo) is given in (6.1) with p,(x) = \p(j> — 1) £ 1 ~ 2//p and 
<j 2 {x) = p 2 x 2 ~ 2 / p . If c > ^p p+1 /(p — l) p_1 . it follows from Theorem 3 
that the value function is given by 

V*(x,s) = a - frg.iay-V'x 1 " + f g*(s) + ^x 
where s 1— ► g*(s) < s is the maximal solution of the differential equation 

The maximal solution (see Figure 2) can be found to be g* (s) = a°s where 
< a£ < 1 is the greater root of the equation (the maximality principle) 

a _ Q i-i/p +p /(2 c ) = o. 

The equation admits two roots if and only if c> ^ p p+1 /(p — l) p ~ 1 .Further, 
the stopping time 

r*(c) = inf{i>0 : X t <a$S t } 

satisfies E x , s (t*(c)p/ 2 ) < 00 if and only if c > \jP +1 /{p- l) v ~ l ■ By 
an extended version of Theorem 3 for non-negative diffusions and an obser- 
vation in Example 3, it follows by the definition of the value function for 
c> lp p+1 /(p-l) p ~ 1 that 

E B (max <t< T \B t \ p ) < -^ E X (|B T | P ) + V*(x,x) - ^x p 



for all stopping times r for (B t ) satisfying E(r p / 2 ) < 00 . Letting c J, 
2- 



hp p+l /(p — l) p 1 i the Doob's inequality follows. 



THEOREM 4 Let (Bt) be a standard Brownian motion started at x under 
P x for x > , let p > 1 be given and fixed, and let r be any stopping time 
for (Bt) such that E x (t p ' 2 ) < 00 . Then the following inequality is sharp 

E x (mnxo<t<T\Bt\ p ) < (j^Y E x (\B T y) - ^ X p . 

The constants (p/(p — l)) p and p/(p — 1) are the best possible and the 
equality is attained through the stopping times r* = inf { t > : Xt < 
aSSt] for cl±pP +l /(p-l)P- 1 . 

For details see [Graversen & Peskir, 1997]. The results are extended to 
Bessel processes in [Dubins et al, 1993] and [Pedersen, 2000]. 
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■ Jl 




S(s) 



Figure 2. A computer drawing of solutions of the differential equation (6.10). The bold line is 



the maximal solution which stays below and never hits the diagonal in 
principle, this solution equals p* . 



. By the maximality 



EXAMPLE 7 (Russian option). This is an example of pricing an Ameri- 
can option with infinite time horizon in the framework of the standard Black- 
Scholes model. The option under consideration is the Russian option (see 
[Shepp & Shiryaev, 1993]). If (X t ) is the price process of a stock then the 
payment function of the Russian option is given by 

ft — maxo< u <tX u 

where the expiration time is infinity. Thus, it is a perpetual Lookback option 
(see [Conze & Viswanathan, 1991]). Assume a standard Black-Scholes model 
with a dividend paying stock; under the equivalent martingale measure the 
price process is thus the geometric Brownian motion 

dX t = (r- X)X t dt + aX t dB t 

with A > the dividend yield, r > the interest rate and a > the 
volatility. The infinitesimal generator of (Xt) on (0, oo) is given in (6.1) 
with p,(x) = (r — \)x and a(x) = ax . 

Under these assumptions, the fair price of the Russian option is - according 
to the general pricing theory - is the value of the optimal stopping problem 

C*(x,s) = sup E XjS (e _rr f T ) = sup E XtS (e~ rT S T ) 



where the supremum is taken over all stopping times t for (Xt) . To solve 
this problem, the idea is to apply Example 3 and the maximality principle for 
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this discounted optimal stopping problem. The recipe from the previous section 
is applied to solve the problem. 

1. As in Example 5, and using an observation in Example 3, it is natural to 
formulate the following free-boundary problem that the value function and the 
optimal stopping boundary is a solution of 



LxC(x, s) = r C(x, $) for g(s) < x < s 
= (normal reflection) 
— s (instantaneous stopping) 



dC. 



C(x, s) 
dC 



z=0(s)+ 



dx 



(x,s) 



x=g(s)+ 



(smooth fit). 



Since the setting is smooth, the principle of smooth fit should be satisfied. 
The theorem below shows that this system is indeed correct. 

2. Let 71 < and 72 > 1 be the two roots of the quadratic equation 



1<j2 7 2 + ( r _ A _ I a 2) 7 _ r = 



and set 



0* = 



1/72 \ 



1/(72-71) 



,l-l/7i; 
The solutions to the free-boundary problem are 

, 71 



< 1 



C(x, s) = 



72 -71 



72 



9(s) 



7i 



9(*) 



72 



where s t— > g(s) satisfies the nonlinear differential equation 



9 ' {s) ~\i2\gU) 7iU*)j 




9(s) 



72+1 



9(') 



71+1 



The maximality principle says that maximal solution of the differential equa- 
tion is the optimal stopping boundary. It can be shown that g*(s) — fl*s is 
the one. 

3. The standard procedure of applying ltd formula, Fatou's lemma etc. can 
be used to verify that the estimated candidates are indeed correct. The result 
on the fair price of the Russian option is stated below. 

THEOREM 5 The fair price of the Russian option is given by 

72-7i\ ^/5*s/ V/3*s/ J 
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and the optimal stopping time is given by 

r* = inf{i>0 : X t <0*St}. 

The fair price of the Russian option was calculated by [Shepp & Shiryaev, 
1993] which also should be consulted for more information and details. The 
result is extended in [Pedersen, 2000a] to Lookback options with fixed and 
floating strike. 

Example 8 (Optimal prediction of the ultimate maximum of Brownian 
motion). This example presents solutions to the problem of stopping a Brow- 
nian path as close as possible to the unknown ultimate maximum height of 
the path. The closeness is first measured by a mean-square distance and next 
by a probability distance. The optimal stopping strategies can also be viewed 
as selling strategies for stock trading in the idealized Bachelier model. These 
problems do not fall under the general optimal stopping theory, since the gain 
process is not adapted to the natural filtration of the process. 
In this example the diffusion Xt = Bt ■ Let 

J—oo 

for x G R denote the distribution function of a standard normal variable. Let 
St be the family of all stopping times r for (Bt) satisfying t <T . 

Mean-square distance 

This problem was formulated and solved by [Graversen, Peskir & Shiryaev, 
2001] and in [Pedersen, 2003] the problem is solved for all L q -distances. Con- 
sider the optimal stopping problem with value function 

V, = inf E((Si-B r ) 2 ) . (6.11) 

The idea is to transform problem (6.11) into an equivalent problem that can be 
solved by the recipe presented in the previous section. 

To follow the above plan, note that S\ is square integrable; then in accor- 
dance with Ito-Clark representation theorem formula 



f H u 
Jo 



5i = E(5 a )+ / H u dB u 
Jo 

where (H t ) is a unique adapted process satisfying E( f H^du) < oo 
Furthermore, it is known that 



«-*-{&*)) 
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If (Mt) denote the square integrable martingale Mt = J H u dB u , then the 
martingale theory gives that E(S\ — B T ) = E( L (1 — 2H U ) du) + 1 for all 
t € <Si . Problem (6. 1 1) can therefore be represented as 

V,= inf E{ j T c(i Su ~ Bu - \du ) +1 



u 



% = inf e( f c{ -^ML ) du ) + 1 



T £ s i \Jo \ y/l — u 
where c(x) = 4$(x) — 3 . By Levy's theorem and general optimal stopping 
theory, the problem (6.11) is equivalent to 

tESi \Jo \ \Jl — u, 

The form of the gain function indicates that the deterministic time-change 
method introduced in Example 4 can be applied successfully. Let at — 1 — e -2 * 
be the time-change and let (Zt)t>o be the time-changed process given by 
Zt = B at /yJ\ — &t • It can be shown by ltd formula that {Z%) solves the 
stochastic differential equation 

dZ t = Z t dt + V2dl3t 

where (Pt)t>0 is a Brownian motion. Hence (Zt) is a diffusion with the 
infinitesimal generator 

d d 2 

oz az z 

for z £ K . Substituting the time-change yields that 



K = inf E ( I" 7 e~ 2u c(\Z u \) du)+l. 

Hence the initial problem (6.11) reduces to solving 

W*(z) = inf E z ( r e- 2u c(\Z u \)du\ (6.12) 

where the infimum is taken over all stopping times a for (Zt) and V* = 
W*(0) + 1 . This is a problem that can be solve with the recipe from Sec- 
tion 18.5. 

1. The domain of continued observation is a symmetric interval around zero, 

that is C = {z <E K| z £ (— 2*, 2;*) } and the value function is an even C - 

function or equivalent Wl(0) = . From the observation in Example 3 one is 

led to formulate the corresponding free-boundary system of the problem (6.12) 

L z W(z) - 2W(z) - -c(\z\) for - z* < z < z* 
W(±z*) = (instantaneous stopping) 

W\±z*) = (smooth fit) 

W'(0) = (normal reflection) . 
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2. The solution of the free-boundary problem is given by 

W(z) = *(z.) (1 + z 2 ) - 2$'(z) + (1 - z 2 ) $0) - 3/2 

for z € [0, z*] where z* is the unique solution of the equation (6.13). 

3. By ltd formula it can be proved that W(z) is the value function and 
a* = inf {t > : \Z%\ > z* } w a« optimal stopping time. Transforming the 
value function and the optimal strategy back to the initial problem (6.11) the 
following result ensues (for more details see [Graversen, Peskir & Shiryaev, 
2001]). 

THEOREM 6 Consider the optimal stopping problem (6.11). Then the value 
function V* is given by 

V* = 2$(z*) - 1 fa 0.73 

where z* fa 1.12 is the unique root of the equation 

4$(z*) - 2z*$'(z*) - 3 = . (6.13) 

The following stopping time is optimal (see Figure 3) 



r* = inf { t > : max < w <* B u - B t > z*y/l -t} 



(6.14) 



i i maxo<u.<s Bu — Bt 




Figure 3. A computer drawing of the optimal stopping strategy (6. 14). 
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Probability distance 

The problem was formulated and solved in [Pedersen, 2003]. Consider the 
optimal stopping problem with valuefunction 



Wi e) = sup P(Si-B T <e) 

7-eSi 



(6.15) 



for e > . Furthermore, in this case, the gain process is discontinuous. Using 
the stationary independents increments of (2?t) yields that 

Wi e) = sup e(e(1 [0 , £] (Si - B T ) | T T )) 
= sup E(F 1 ^ T (e) ; S T - B r < e) 

where Ft(e) = 2^(e/yi) — 1 is the distribution function of St ■ By Levy's 
theorem and the general optimal stopping theory the stopping problem (6.15) 
is equivalent to solving 

Wi e) (t,x) = sup ExiFi-t-rie) ; \B T \ < e) 

T€Sl-t 

for t < 1 and x € R . It can be shown that it is only optimal to stop if 
\B T \ = e on the set {r < 1 — t} . This observation - together with the 



1.12 



0.75 




0.09 



0.59 



Figure 4. A computer drawing of the optimal stopping strategy (6.16) when e = 0.75 and 
s = 1.12 . Then U = 0.75 and t. = 0.09 respectively. 
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Brownian scaling property - indicates that the optimal stopping time is of the 
form 

r\* =inf{0<u< 1-t : \B U \ = eb tt (t + u) } (inf0 = l-t) 

where the boundary function bt,(t) = 00 ift < t* and bt t (t) = 1 elsewhere 
for some < i* < 1 to be found. This shows that the principle of smooth fit is 
not satisfied in the sense that the value function W* is not C 1 at all points 
of the boundary of the domain of continued observation. More precisely, the 
smooth fit breaks down in the state variable x because of the discontinuous 
gain function. However, due to the definition of the gain function the smooth 
fit should still hold in the time variable t and this implies - together with 
ltd formula and the shape of the domain of continued observation - that the 
principle of smooth fit at a single point should hold. This approach provides a 
method to determine t* . 
Set 

wW(t,x) = E^^o^^e) ; |S T o A(1 _ t) | < e) 

= E^.^ A(1 _ t) (e) ; r° < 1 - t) + P x (r? >l-t) l( , e ](|x|) 

For fixed t < 1 , x 1— > W^ s '(t,x) is in general only continuous at \x\ = e . 
Let e* « 1.17 be the point satisfying that x 1— > W^(0, x) is differentiable 
at \x\ = e . The result is the following theorem. 

THEOREM 7 Consider the optimal stopping problem (6.15). Set 

U = (l-(e/e*) 2 ) VO. 

(i) If t* = , then the value function is given by (see Figure 5) 



wl e) = W& (0,0) 



_ 1 + 2,/' «=*(-*> 



oo 



-4^(-l) fe $(-(2fc + l) £ ) 

fc=0 



(ii) IfU>0, then the value function is given by (see Figure 5) 

W^ = WW(0,0) = -* = £° W V(U t x) v (-^=) cte 
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»l-t 



W^(t,x) = l + f Fi-t-yie) 
Jo 

y, , x + (2k + l)g ( x + (2k + l)g N 

k=— oo 

-2 f; (-l)^sgn(x + (2fc + l) £ )^(- |g+ ^± 1)£| ) 



dy 



fc=— oo 



ybr < x < £ a« J 

for x > e . 

In both cases, the optimal stopping time is given by (see Figure 4) 

t* = inf { U < t < 1 : max < u <t B u - B t = e } (inf = 1). (6.16) 




Figure 5. A drawing of the value function W* as a function of e . 
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19.1 Introduction 

The universality of critical phenomena in phase transitions has attracted 
physicists for more than 25 years [Stanley, 1971]. Soon after also the rele- 
vance for epidemiological and in general birth-death processes was recognized 
([Grassberger & de la Torre, 1979],[Grassberger, 1983]). For a recent popu- 
lar account of universality and its applications in various scientific fields see 
[Warden, 2001]. 

Two case studies will be presented to demonstrate the various aspects of crit- 
icality in epidemiology. In our first case studies we will show how an epidemic 
system can display huge variability while crossing a critical threshold: Measles 
in decreasing vaccination levels caused by a loss of confidence in vaccines in 
an originally highly vaccinated population (e.g. due to ongoing discussions 
on vaccine side effects, especially the combined measles, mumps and rubella 
vaccine MMR claimed to cause autism, as discussed in Great Britain). 

Not only criticality as such but development of a system towards this crit- 
icality has been postulated for physical systems ([Bak et al, 1987], [Bak et 
al, 1988]) with the paradigmatic system of a sand pile (see for an overview 
[Jensen, 1998]). 

In our second case study we present a system consisting of host classes in- 
fected with different mutants of a pathogenic agent leading the epidemic sys- 
tem towards criticality: bacterial meningitis. This system is of much broader 
interest, since it potentially provides an explanation for uncertainties and huge 
fluctuations for more general models in evolutionary biology. This approach 
is more realistic than previous attempts in oversimplified evolutionary models 
([Bak & Sneppen, 1993], [Flyvbjerg et al, 1993]). We show explicitly that a 
parameter is automatically driven towards its critical value. The pathogenicity 
evolves to small values near its critical value of zero. In the analysis it evolves 
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to zero, since for analytic treatability we use reasonable approximations that 
show correct qualitative behaviour. In the full system the pathogenicity will 
evolve to small values, in the order of magnitude of the mutation rate where 
competing strains can replace each other. 

Epidemics with critical fluctuations have been described in the literature be- 
fore ([Rhodes & Anderson, 1996], [Rhodes et al, 1997]) in forest fire like sce- 
narios ([Jensen, 1998], p. 68). We present a non-spatial stochastic model, espe- 
cially a master equation (time-continuous Markov process), leading in critical- 
ity to power laws with exponents of mean field type (essentially the branching 
process exponent 3/2), confirming that the system under investigation really 
establishes critical fluctuations with fat tail behaviour. 

A spatial system analysis would require a renormalization approach to path 
integrals which are derived from the spatial master equation. This method is 
still under controversial debate, even in chemical systems' analysis ([Cardy, 
1996], [Wijland, 2001], [Park et al, 2000]), and can only be scetched here. 

19.2 Basic epidemiological model 

In this section we describe the basic epidemiological model which will un- 
derlie in modifications the following sections. It describes a non-spatial homo- 
geneous mixing population of hosts in different states of infection. A corre- 
sponding spatial model will be given and analyzed in the final sections. 

Since we will describe fluctuations near critical states we have to consider 
stochastic models, Markov processes explicitly formulated in master equa- 
tions, as used in physics and chemistry (see e.g. [van Kampen, 1992]). 

19.2.1. The SIR-model 

The basic SIR-model for a host population of size TV devided in subclasses 
of susceptible, infected and recovered hosts [Anderson & May, 1991] is con- 
struced as follows: With a rate a a resistent host becomes susceptible, or as a 
reaction scheme R — > S. Then, susceptible meet infected with a transition 
rate j3 and proportional to the number of infected (devided by N to make the 
model scale invariant with population size, since we obtain a quadratic term 
in the variables, as opposed to the linear term in the previous transition). As a 

reaction scheme we have S + I — > I + 1. Finally, infected hosts can recover 
and become temporally resistent with rate 7, hence / — > R. 



We could call this basic SIR-model also SIRS-model, since transitions from R to S are allowed, but stick 
to SIR, since later in an SIRYX-model, with additional classes of hosts to be introduced later, parallel 
transitions prohibit a simple way of labelling. Hence, here SIR just means that we have three classes of 
hosts, S, I and R to deal with, as opposed to 5 classes in the more complicated model. 
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The corresponding deterministic ordinary differential equation (ODE) sys- 
tem reads 

f - «•*-*£■* 

- = T-J-a-i? 

and describes merely the dynamic of the mean values for the total number 
of susceptibles, infected and recovered under the assumptions of mean field 
behaviour and homogeneous mixing, hence mean values of products can be 
replaced by products of means in the nonlinear contact term (/3/iV) I ■ S. 

19.2.2. Stochastic modelling 

We include demographic stochasticity into the description of the epidemic. 
As such, for the basic SIR-model we consider the dynamics of the probability 
p(S, I, R, t) of the system to have S susceptibles, / infected and R recov- 
ered at time t, which is governed by a master equation ([van Kampen, 1992], 
[Gardiner, 1985], and in a recent application to a plant epidemic model [Stol- 
lenwerk & Briggs, 2000], [Stollenwerk, 2001]). For state vectors n, here for 
the SIR-model n = (S, I, R), the master equation reads 

dp(n) x~^ /-\ v^ / \ 

—j— = 2^ Wn,n P{n) ~ 2_y w ^ p (-> < 2 - 2 ) 

with transition probabilities corresponding to the ones described above for the 
ODE-system. Here the rates w^n are 

W(S+1,I,R-1),(S,I,R) = oc-R 

I „ 

W(S-l,I+l,R),(S,I,R) = P • Jf S (2.3) 

W(S,I-1,R+1),(S,I,R) = 1-1 
from which the rates Wn,n follow immediately as 

W{sj,r),{s-i,i,r+\) = a ■ (R + 1) 

«'(5,/ > fl),(S+i,/-i,H) = P ■ -jf- (S + 1) (2.4) 

W(S,I,R),{S,I+1,R-1) = 7 •(!+ 1) • 

This formulation defines the stochastic process completely and will be the basis 
for modified models, e.g. additional terms for vaccination in the next section. 
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19.3 Measles around criticality 

Measles epidemics in human populations have been a subject of investiga- 
tions for a long time ([London & Yorke, 1973], [London & Yorke, 1973a], 
[Dietz, 1976]), since rather good empirical time series are available, and var- 
ious aspects of recent paradigmatic theories like deterministic chaos in pre- 
vaccination dynamics ([Schwartz & Smith, 1983], [Schenzle, 1984], [Aron & 
Schwartz, 1984], [Schaffer, 1985], [Schaffer & Kott, 1985], [Olsen & Schaf- 
fer, 1990], [May & Sugihara, 1990], [Rand & Wilson, 1991], [Grenfell, 1992], 
[Bolker & Grenfell, 1993], [Drepper et al, 1994]) and criticality in island pop- 
ulations have been investigated ([Rhodes & Anderson, 1996], [Rhodes et al, 
1997]). 

Here we investigate a vaccinated population, i.e. the only stable station- 
ary state is the disease-free population and any invading disease cases lead to 
quickly extinct epidemics, in which the vaccination level drops below the crit- 
ical threshold, where epidemics can take off. The consideration of dropping 
vaccination levels is motivated by the observation that in the United Kingdom 
of Great Britain a discussion on side-effects of vaccines led to a dramatic drop 
in vaccine uptake [Jansen et al, 2002]. 

19.3.1. The ODE system for the SIR-model with 
vaccination 

The ODE system for the SIR-model with vaccination reads 

S = ti(N-S)-pjjS-vS 

I = PjjI-{l + y)I (3.1) 

R = -yl-nR + vS 



with v := a • p the vaccination rate. Here a is a time rate for the vaccination 
and /othe proportion of vaccinated susceptibles. Only the product of both has 
importance in the model. 

19.3.2. Stationary state and vaccination threshold 

From Equ. (3.1), defining functions /, g and h as 



S = 

i = 

R = 



f(S,I,R) 

g(S,I,R) (3.2) 

h(S, I, R) 
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we obtain the stationary state by the conditions f(S*,I*,R*) = 0, 
g(S*,I*,R*) = and h(S*,I*,R*) = 0. Since we have quadratic terms 
we find two equilibria. 

In the stationary state /* = (no epidemics) we find 



S$ = N 



f* 



H + v 







R* = N - St - It = N 



H + v 



(3.3) 



Stability analysis gives the condition for the vaccination threshold. The 
Jacobian matrix around the stationary state x := (S*, I*,R*) is given by 



d_l 
dx 



I as w m \ 

dg dg dg 

M 1 ffl 

dh dh dh I 

~d~S 81 3fl / 



(3.4) 



hence 



/ -n-/3£-v 



d£ 
dx 



P N 



o \ 



^ 



\ 



N 
V 



/?f-(7 + M) 



7 



"A*/ 



The characteristic polynomial is given by 



(3.5) 



-M-/3^-^-a)^-(7 + /.)-a)(-^-A) 

r* c* 

(3.6) 

One eigenvalues is simply A3 = —fi, and after some calculation two further 
eigenvalues are: A2 = — (/x + v) and 



Ai = 0jf - (7 + /i) 



(3.7) 



those two being interesting for the further considerations. The requirement 
Ai = gives the threshold value v c , or critical vaccination value, 



v, = 



JL 

7 + 



-(/?-(7 + M)) 



(3.8) 
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19.3.3. Definition and expression for the reproduction 
number 71 

In the endemic stationary state /* 7^ 0, where the disease is always present, 
we find 



>*-« p , J2 ~ V/3 Vt+^ V pp 2 2 2 ■ 

(3.9) 

With the heuristic definition of the reproduction level, called 1Z, measured in 
stationarity 

K-^:=l (3.10) 



we obtain 

S* 2 1 _7 + /^ 



(3.11) 



N K 

Then the critical vaccination threshold can be expressed as function of 1Z 

v c =-%—(/3-(7 + n))=fin-n . (3.12) 

7 + fi \ / 

19.3.4. Vaccination level at criticality v c 

At the criticality threshold v c we obtain the classical results for the vaccina- 
tion threshold [Anderson & May, 1991], namely c c = 1 — l/7£, where c c is 
the critical value of the vaccination level c when writing the ODE for S in the 
form 

S = fi({l-c)N-S)-/3jjS (3.13) 

as opposed to 

S = H(N-S)-pjjS-vS (3.14) 

Explicitly the argument goes as follows: At criticality v c = fi(TZ — 1), and 
from the definition 1Z = P/(j + /x), we obtain 



/i 1 



(3.15) 



hence 



>From 



N n + v c K ' 
v c -Si = n(n-\)-^ = /i(l-l) N . (3.16) 

S = n(N-S)-f3~ S-vS (3.17) 
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we therefore have in stationarity 

S = ti{N-S)-0JjS-v c -S* c -. (3.18) 

With Equ. (3.16) we finally get the analogous form of Equ. (3.13) 

i -"{{ i -{ i -k)) ,, - s )-4 s <3i9) 

from which it follows directly that c c = 1 — 1/71. 

19.3.5. Parameters for measles epidemics 

Rough estimates for measles parameters are average life time /i _1 = 75 
years, average infection period j~ 1 = 0.02 years from an estimate of around 
1 week. 

Mean age of infection (/3 • I* /N)~ l = 5 years, with I*/N in endemic 
equilibrium without vaccination, gives 

/? = ( 7 + fj) (J± + 1 ) w 7 -^ = 750years _1 . (3.20) 

\5years / 5years 

The average age of vaccination can be a~ l « 1 year to 3 years. Since it 
only varies the percentage of to-be-vaccinated sucseptibles p, we do not have 
to specify this parameter very accurately, taking a" 1 = 3 years. 

19.3.6. Stochastic simulations 

Simulations are done in the frame work of master equations to capture the 
population noise, using Gillespie's algorithm [Gillespie, 1976]. The Gille- 
spie algorithm, often also called minimal process algorithm, is a Monte Carlo 
method, in which after an event, i.e. a transition from state n to another state n, 
the exponential waiting time is calculated as a random variable from the sum 
of all transition rates, after which the next transition is chosen randomly from 
all now possible transitions, according to their relative transition rates. 

In analogy to the SIR-model described previously, using Equ. (2.2), the 
rates Wn,n for our model with vaccination are 



(3.21) 



W (S-1,I+1,R),(S,I,R) 


= 





I 

"n 


s 


W (S,I-1,R+1),(S,I,R) 


= 


7 


I 




W (S+1,I,R-1),(S,I,R) 


= 


A* 


R 




W(S+1,I-1,R),(S,I,R) 


— 


A* 


I 




W(S-1+1,I,R),(S,I,R) 


— 


A* 


-S 




W (S-1,I,R+1),(S,I,R) 


= 


v ■ 


S 
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from which the rates w n „ follow immediately as 



p 


~N " 


7 


U + l) 


/* 


(R+l) 


M 


•U + l) 


M 


■S 


V 


(S + l) 



W(S,I,R),(S+1,I-1,R) = P ■ — ^T~ (S + 1) 

w (S,I,R),(S,I+l,R-l) 

W(S,I,R),(S-l,I,R+l) = fi-(R+l) (3.22) 

W(S,I,R),(S-1,I+1,R) 

W(S,I,R),(S+l-l,I,R) 

W(S,I,R),(S+1,I,R-1) 

19.3.7. Bifurcation diagram for vaccine uptake 

We plot for each value for the vaccine uptake c the size of several epidemics 
after 3 years, when starting with one infected at the starting time. This shows 
that for high uptake rates only small epidmics are found, but for low values 
either the epidemic takes off with high epidemic levels or still dies out quickly 
(bifurcation diagram). Large fluctuations are visible around the deterministic 
threshold value for c. 
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Figure 1. Bifurcation diagram for vaccine uptake c. 



At the equilibrium without infected (see above), we have 

A* 



St = N- 



fj> + v 



(3.23) 
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or in terms of c instead of v 



hence 



SI = N ■ (1 - c) 



v(c) = 



jXC 



(3.24) 



(3.25) 



or 



c(v) = 



H + v 



(3.26) 



In Fig. 1 we show stochastic simulations for various values of c, recalculat- 
ing v(c) for the simulations and starting each in the stationary values for S, R 
and one infected I = \. The simulations are done for 3 years of epidemics. 
This summarizes the previous plots. 

19.3.8. Epidemics when dropping the vaccine uptake 

We consider the size of epidemics when lowering the uptake from 96% to 
80%, introducing one infected at time t{. 
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Figure 2. a) S(t) with a = 96% and c 2 = 80% starting at S(t ) := Si(ci) = N(l - Ci), 
but with C2 (respectively v{c2)) all the time, b) Size of epidemics when dropping the uptake 
from 96% to 80%, introducing one infected at t 



From 



S = f i((l-c)N-S)-(3-S 



(3.27) 



with c := C2and I{t) = 0,no infected around in the system, we obtain with 
5(«o):=^f( Cl ) = JV(l-c 1 ) 



S(t) = N(l - Cl ) + N(c 2 - ci) • (1 - e-Mt-to)) 



(3.28) 
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i.e. SCtoo) = N(l - c 2 ) = Sl(c 2 ) and N(c 2 - Cl ) = £(*«,) - S(t ). 

For the Fig. 2 b) we take S(t) = S(ti) as starting conditions for a stochastic 
simulation for 1 year of epidemics introducing exactly one infected at time t{ 
into the system. For the stochastic simulations and S{€) we have to consider 
the dynamics of the fast vaccination time scale with v{c) instead of c itself. So 
we start with 

S = /j(iV - S) - vS (3.29) 

giving 

S(t) = S(t ) + (S(t«,) - S(t )) • (1 - e -(^)C-<°)) (3.30) 

with S(to) = N(l - ci) = Nfi/(fi + vi) and S^) = N(l - c 2 ) 
= N/j,/(fi + ^2) equally if expressed in v or c. This results in the faster time 
scale for S(t) with (fj, + t>2) in the exponential instead of the slow \i only. 

In summary this shows that the decrease in vaccine uptake to low levels 
shows only after some time, during which the number of susceptibles is built 
up, large epidemics are becoming more and more likely. Translated into the 
situation in the UK, large outbreaks of measles are to be expected soon, since 
the vaccination level, varying regionally, has dropped from around 96% to as 
low as 85% and in some parts of London even below 80%. 

19.4 Meningitis around criticality 

This section is based on previous work [Stollenwerk & Jansen, 2002], but 
also includes later results. Though meningitis and septicaemia are only rarely 
observed diseases, and often in linked smaller or larger epidemics, the bacteria 
causing the disease can be detected in as many as 30 or 40 % of the host 
population as harmless comensals. Rarely, mutations in these bacteria occur 
and from time to time they make the severe mistake to harm their hosts heavily, 
in former times almost always fatally. 

We model the host dynamics for meningitis and septicaemia as a simple 
SIR-model for the harmless strain of bacteria, and additional classes for the 
infection with mutant bacteria, called Y hosts, and heavily diseased cases X, 
With this model we can show that huge fluctuations appear when the chance of 
a mutant causing a diseased case, called pathogenicity, is small. Furthermore, 
we can show that in systems with mutations of various values ofpathogenicity 
only those with small pathogenicity are present for significant periods of time. 
For such small values of the pathogenicity we can furthermore show power law 
behaviour of the size distribution of epidemics (see [Stollenwerk & Jansen, 
2002] for details), hence demonstrate that the system is in criticality. The 
aspect of evolution towards criticality is first described here. 
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19.4.1. The meningitis model 

In order to describe the behaviour of pathogenic strains added to the basic 
SIR-system we include anew class Y of individuals infected with a potentially 
pathogenic strain. We will assume that such strains arise by e.g point mutations 
or recombination through a mutation process with a rate fj, in the "reaction 
scheme" S + I — > Y + I. ( For symmetry, we also allow the mutants to 
backmutate with rate v, hence S + Y — ► / + Y .) 

The major point here in introducing the mutant is that the mutant has the 
same basic epidemiological parameters a, (3 and 7 as the original strain and 
only differs in its additional transition to pathogenicity with rate £. 

These mutants cause disease with rate e, which will turn out to be small later 
on, hence the reaction scheme is S + Y — ► X + Y. This sends susceptible 
hosts into an X class, which contains all hosts who develop the symptomatic 
disease. These are the cases wich are detectable as opposed to hosts in classes 
Y and / who are asymptomatic carriers who cannot be detected easily. 

The state vector in the extended model is now n = (S,I,R,Y,X). The 

mutation transition S + I — ► Y + I fixes the master equation transition rate 

W{S-\,i,r,y+i,X),{S,I,R,y,x) = M • (I/N) ■ S. In order to denote the total 
contact rate still with the parameter /3, we keep the balancing relation 

W(S-1,I+1,R,Y,X),(S,I,R,Y,X) 

I (4.1) 

+W(S-l,I,R,Y+l,X),(S,I,R,Y,X) = P • Jf ■ S 

and obtain for the ordinary infection of normal carriage the transition rate 

W(S-l,I+l,R,Y,X),(SJ,R,Y,X) - (P ~ M) ' (!/ N ) ' S - Respectively, to denote 
the total rate of contacts a susceptible host can make with any infected, either 
normal carriage / or mutant carriage Y, by f3, we obey the balancing equation 

j 1 y 

X) w (S-hm),(s,m) = P — ^— • S (4.2) 

mftm 

for m = (I, R,Y,X). With the above mentioned transitions this fixes the 
master equation rate W( S -i,i,r,y+i,xus,i,R,y,X) = (/? - 1/ - e) • (Y/N) ■ S. 

For completeness, we introduce a recovery from the severe meningitis re- 
spectively septicaemia with rate f, hence X — > S. With regard to meningitis 
and septicaemia in many cases the disease is fatal, hence <p = 0. With medi- 
cation the sufferers often survive, but are hospitalized for a long time and then 
suffer from resulting impairments. So for the theoretical analysis we will still 
keep <p — 0, which might be changed when analysing more realistic situations 
or recent data. 

For the SIRYX-system the transition probabilities Wfi in are then given (omit- 
ting unchanged indices in n, with respect to n) by 
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X + Y 
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™(X-l,S+l),(X,S) 
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<p-X 


X 


9 


S 

(4.3) 


along with the respective reaction schemes. Again from w 


n, n the rates w n ,n 


follow immediately 


. This defines the master 


equation for the full SIRYX- 


system. 













19.4.2. The invasion dynamics of mutant strains 

Before we proceed with further theoretical analysis of the model we now 
demonstrate basic properties of our SIRYX-model in simulations of the master 
equation, using the Gillespie algorithm, also known as minimal process algo- 
rithm [Gillespie, 1976]. This is a Monte Carlo method, in which after an event, 
i.e. a transition from state n to another state n, the exponential waiting time 
is calculated as a random variable from the sum of all transition rates, after 
which the next transition is chosen randomly from all now possible transitions, 
according to their relative transition rates. 

To investigate the dynamics of the infection with mutants, class Y, in re- 
lation to the normal carriage I with harmless strains, we first fix the basic 
SIR-subsystem's parameters to the values a := 0.1, /3 := 0.2 and 7 := 0.1. 

The endemic equilibrium of the SIR-system is given by 



S* = N 



I* = N- 



a ( j3 — 7 



P \a + -y 



R* = N-S*-I* (4.4) 



as can be seen from Equs. (2.1) setting the left hand side of each subequation 
to zero and 7 := 0.1. This equilibrium would correspond to labelling 2, hence 
$2 etc., in previous chapters. As for the parameters used, we find in equilib- 
rium a normal level of carriage of harmless infection of about 25% in our total 
population of size N. This is in agreement with reported levels of carriage 
for Neisseria meningitidis. Average duration of carriage is in the order of 10 
months, hence we choose 7 = 0.1. We assume the duration of immunity to be 
the same as the duration of carriage. In equilibrium this results in the ratio of 
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S* ■ I* : R* = 2 : 1 : 1. However, the qualitative results are not affected by 
these first guesses of parameter values, but rather the order of magnitude. 

Interesting behaviour is observed when the pathogenicity £ is too large for 
the hyperinvasive strain to take over but small enough to create large outbreaks 
of mutant infecteds Y before becoming extinct again. In Fig. 3 we show two 
simulations in this e-region, first e = 0.05, Fig. 3 a), b), then a ten times 
smaller e, Fig. 3 c), d). For high pathogenicity e we find relatively low levels 
of mutants Y, in Fig. 3 a) less than 20 cases, and at the end of the simulation 
roughly between 15 and 80 hospital cases X, Fig. 3 b). For smaller pathogenic- 
ity e, Fig. 3 c), we find much larger fluctuations in the number of mutants Y 
with peaks of more than 80 mutant infected hosts. Though the probability rate 
to cause disease e is ten times smaller than in the previous simulation we find 
at the end of this simulation similar numbers of disease cases X, Fig. 3 d). We 
observed larger fluctuations and sometimes much more outbreaks of diseased 
cases though the probability to create disease is smaller. 

This counter-intuitive result can be understood by considering the dynamics 
of the hyperinvasive lineage in detail. We will do so by analyzing a simplified 
version of our SIRYX-model analytically. 

19.4.3. Divergent fluctuations for vanishing pathogenicity 

For pathogenicity e larger than the mutation rate \i the hyperinvasive lineage 
normally does not attain very high densities compared to the total population 
size. Therefore, we can consider the full system as composed of a dominating 
SIR-system which is not really affected by the rare Y and X cases, calling it 
the SIR-heat bath, and our system of interest, namely the Y cases and their 
resulting pathogenic cases X, considered to live in the SIR-heat bath. 

Taking into account Equs. (4.4) for the stationary values of the SIR-system 
we obtain for the transition rates (compare Equs. (4.3) ) of the remaining YX- 
system 



W (S*,Y+1),(S*,Y) = V • w !* „ = 

u>(s;y+i),(s;y) = (0 - " - e) ■ 77 Y = 

v>(s;x+i),(s;x) = e • w Y = 

W(Y-1,R*),(Y,R*) = 1-Y = 

w (x-i,s*),(x,s*) - <P- X . 



c 

b-Y 

9 ■ Y (4.5) 

a-Y 



All terms not involving Y or X vanish from the master equation, since the 
gain and loss terms cancel each other out for such transitions. If we neglect 
the recovery of the disease cases to susceptibility, <p = 0, as is reasonable for 
meningitis, we are only left with Y-dependent transition rates. Hence for the 
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Figure 3. a) Time series of ten runs showing the mutant carriage Y for pathogenicity 

e = 0.05. b) Number of seriously diseased cases X for pathogenicity e = 0.05. c) and d) 
as a) and b) with pathogenicity ten times smaller, hence e = 0.005. Although the pathogenicity 
e is of the factor ten smaller, the damage in the number of seriously diseased cases X remains 
high and even varies more than for larger e. 



YX-system we obtain the master equation 



d 



-p(Y, X, t) =(& ■ (Y - 1) + c) p(Y - 1, X, t) + a ■ (Y + 1) p(Y + l, X, t) 

+ 9 -Yp(Y,X-l,t)-{bY + aY + gY + c)p(YX,t) 

(4.6) 

This gives for the marginal distribution p(Y, t) := J2x=oPO^' X, t) the master 
equation for a simple birth-death process with birth rate b :— (/? — v — e) • j^, 



Criticality in epidemics:... 469 

death rate a := 7 and a migration rate c := \x • ^ J*. In the definition of the 
marginal distribution we take the upper limit of the summation to infinity, since 
we assume numbers of X and Y cases to be well below the stationary values 
of the SIR-system, i.e. they will not be affected by any finite upper boundary. 
We will check the validity of this assumption later with simulations of the full 
SIRYX-system. 

Hence we have for Y E N 

~p(Y,t) = (b-(Y-l) + c)p(Y-l,t) + a.(Y + l)p(Y + l,t) 

(4.7) 
-(bY + aY + c) p(Y,t) 

and for Y = as boundary equation 

-p(Y = 0,t) = a-p(Y = l,t)-c.p(Y = 0,t) . (4.8) 

For the ensemble mean (V) := ^y=o Y " p0^> we obtain, using the above 
master equation, 

j t (Y) = (b-a)-(Y)+c . (4.9) 

And for the variance, Var(t) := (Y 2 ) — (Y) 2 , we obtain 

j t Var{i) = 2(6 - a)Var(t) + (b + a) • (Y) + c . (4.10) 

We neglect the mutation and backmutation terms, setting c = 0, and z/ = 
in the definition for b. In this case 

b-a = (/3-e)-^- 7 =-e-^ (4.11) 

is proportional to e. We set g := e ■ *£■, and the ODEs for mean Y(t) := (Y) 
and variance Var[t) then read 

Y(t) = -g-Y(t) 

(4.12) 
Var{t) = -2g ■ Var(t) + (27 - g)Y(t) 

with initial conditions F(t = 0) = 1, Var(t = 0) = 0. The solutions are 

Y(t) = e-^-^ , 

(4.13) 
Var(t) = ( 2 ^~g) e -g(t-to) ^ _ e -g(t-to)^ . 
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19.4.4. Evolution towards criticality 

We show now that in a population of equally distributed pathogenicity e 
after some time the hosts with mutants of low pathogenicity remain in the 
system. We assume initially one infected with a mutant of pathogenicity 
for all possible pathogenicities, and then consider the relative frequency of 
infected with certain pathogenicity. 

with 

(Y)(e,t) = e~* (4.15) 

and g := e • 7//?. This is derived from the ODE ^{Y) = (b — a)(Y) with 
b — a = —g. The result is 

M ■ e~ e ^ 
p(e,t)= g _ e x (4.16) 

with initial distribution p(e,to) = l/£m f° r £ € [0, e m ]. For time going to- 
wards infinity p(e, t — > 00) = 6(e), hence all mass ate = 0. 

19.4.5. Simulation for the full SIRYX model 

In simulations of the full SIRYX-system we consider a variety of pathogenic- 
ities e% and for each of those we perform a large number of runs j, recording 
the number of mutant infected Yj(ei,t) over time. Hence the distribution of 
pathogenicities in an ensemble of hosts infected with different mutant strains 
is given by 

p(£i,t):= J (4.17) 

with Aethe length of the considered e- interval times the number of e-values. 
We compare the simulation results with the previous theoretical results in Fig. 5. 

19.4.6. Power law at criticality 

We have shown previously [Stollenwerk & Jansen, 2002] that the size of 
the epidemics, once the epidemics have died out, follows a power law as ob- 
served in branching processes. These power laws are a characteristic sign for 
criticality. 

In a simplified model, where the SIR-subsystem is assumed to be station- 
ary (due to its fast dynamics), we can show analytically divergence of variance 
and power law behaviour for the size of the epidemics p(X) as soon as the 
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Figure 4. Times t = 1, horizontal line, t = 20, slightly tilted line, and t = 100, where all the 
probability is going towards small pathogenicity values. 



pathogenicity is going towards zero. Hence the counter-intuitively large num- 
ber of disease cases in some realizations of the process can be understood as 
large scale fluctuations in a critical system with order parameter e towards zero. 
The master equation for YX in stationary SIR results in a birth-death process 

j t p(Y,X,t) = (b-(Y-l) + c)p(Y-l,X,t) 

+ a ■ (Y + 1) p(Y + l,X,t) + g -Y p(Y,X - l,t) (4 ' 18) 
-(bY + aY + gY + c)p(Y,X,t) . 

Considering e — > and large X, we obtain power law behaviour for the size 
distribution of the epidemic 



p e (X) := lim p(Y = 0, X, t) - — ^ • e\ ■ X'i 



(4.19) 



This was obtained by approximations to a solution with the hypergeometric 
function 



r 2-(* +1 ) ^ /3-X 2-X n , e 



(4.20) 
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Figure 5. Comparison of simulations of the complete SIRYX-system with the theoretical 
curve from the YX-subsystem and assumption of SIR in stationarity. Here time t = 20 is 
shown. 



Such behaviour near criticality is also observed in the ful SIRYX-system in 
simulations where the pathogenicity e is small, i.e. in the range of the mutation 
rate //. 

In spatial versions of this model it is expected that the critical exponents are 
those of directed percolation (private communication, H.K. Jansen, Duessel- 
dorf, see also [Janssen, 1981]). We will discuss the directed percolation and its 
relation to birth-death processes in a subsequent section. 

19.5 Spatial stochastic epidemics 

Non-spatial stochastic processes, as described e.g. in [van Kampen, 1992] 
for chemical and physical processes, have been applied to biology for a long 
time [Goel & Richter-Dyn, 1974], whereas spatial aspects have more recently 
enjoyed considerable attention among biologists, especially ecologists and epi- 
demiologists (e.g. [Keeling et al, 1997], for an overview of the development 
during the 1990s see [Rand, 1999], and recently [Dieckmann et al, 2000]). 

As a starting point we use the master equation approach for a spatial system 
as for example used in [Glauber, 1963] and derive from it equations for the 
dynamics of moments, which under additional assumptions give closed ODE- 
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systems (moment closure methods). Such ODE-systems have very recently 
been used to manage real world epidemics [Ferguson et al, 2001]. In the eas- 
iest moment closure, the mean field assumption, the usual ODEs are found 
back which classically were used as starting points for deterministic models. 
We will show this explicitly for the easiest SIS-model. The approach can be 
applied easily to more complicated models with some more writing effort. 

The spatial master equation as used here will also be applied to investi- 
gate the fluctuations around critical points, a situation in which the simple 
moment closure assumptions do not hold any more. For detailed analysis see 
[Cardy & Tauber, 1998], [Brunei et al, 2000]) and related [Grassberger, 1983], 
[Grassberger & Scheunert, 1980], [Peliti, 1985]. The basic procedure will be 
described in the following section. 

19.5.1. Spatial master equation 

One of the simplest and best studied spatial processes is the birth-death pro- 
cess with birth rate b and death rate a on N sites, of which each can be either 
inhabited I := 1, or empty or solo S :- 1, hence 7 = (in general S := I- 1). 

Translated into epidemiology, I is the infected, S the susceptible class, b the 
infection rate, a the recovery. We refer to it as SIS-system. (In this section we 
use letters a and b etc. as is conventional for spatial birth-death processes with 
no reference to notations used in previous sections.) The master equation for 
the spatial SIS-system is for N lattice points 

d N 

—p{h,...,I N ,t) = ]P wi it i-h p(ii,...,l -Ii,...,I N ,t) 



N 



(5.1) 

1=1 
where ij e {0, 1} and transition rates 




(5.2) 



and 



N 



Wi-iiJi = b [Y^ "V* \-0--h) + a-Ii , (5.3) 

with b birth or infection rate and a death or recovery rate. Here ( J^) is the ad- 
jacency matrix containing for no connection and 1 for a connection between 
sites i and j, hence Jij = Jji £ {0, 1} for % ^ j and J a — 0. 
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Define the number of clusters with certain shapes, for total number 

N 



[/] := £ h (5.4) 



i=l 



and respectively 



and for pairs 



and triples 



or triangles 



N 



[S]:=X)(l-/0 (5.5) 



«=i 



N N 

[//]--=EE^ 7 <- 7 ; < 5 - 6 ) 

A? AT AT 

t /77 i := E E E J ^* • w* < 5 - 7 > 

*'=1 j=l fe=l 



N N N 

[A] :== E E E J H J J* J ki • Iiljh (5-8) 

i=l j=l fe=l 

and so on. 

These space averages, e.g [I] := 52i=i Ii> depend on the ensemble 
(ii, ...,ijv) which changes with time. Hence we define the ensemble average, 

e.g- 

l l 

</)(*):= E - E [I]p(h,...,lN,t) 
h=0 /at=0 

or more generally for any function / = f{h, •■-, In) of the state variables we 
define the ensemble average as 

l l 

(/>(*) :=E- E f(h,:,lN)p(h,..,lN,t) . (5.9) 

7 X =0 I N =0 

We will consider mainly functions like / = [I], / = [7/] etc.. Then the time 
evolution is determined by 

5 (/>(*) " E - E Wi>.., In) jP {h,..,lN,t) (5.10) 

where the master equation is to be inserted again giving terms of the form (/) 
and other expressions (g(Ii, ...,/#)). 



Criticality in epidemics:... 

By defining marginal distributions 



l y l 

p{I u t) := J2 - E - E P(h,-,lN,t) 

I 1= Z~0 I N =Q 
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(5.11) 



and respectively 

li/jy l 

p(/i,Jj,t):=5^... >; ... T ... £p(Ji,...,I JV ,t) (5.12) 

7i=0 7i=0 j[=0 /jv=0 

one obtains for its realizations useful expressions like 

l 1/ 1/ l 

p(/i = 1,^=0,4) := Z-E "E ... X>(Ji,...,Ji = l, 

Ji=0 A=0 /j=0 /jv=0 

/j = 0, ...,Ir/,t) 



1 1 

E~. E /i(i--fj)p(A,...,/w,t) 

Ji=0 /jv=0 



(5.13) 



which we will consider extensively in the subsequent text. The crossed out 
summation signs in X^/ 1= o-" Sz/=o •■•2j JV =o indicate summation with re- 
spect to all sites I\ to Jjv, only excluding summation over /j. 
Hence it follows 

N N 

<//>(*) = EE ww < 5 - 14 > 

i=i j=i 

and with (S^) = {(1 - /*)/,) 

AT JV AT / N \ N N 

ww = EE w*) = Ew E J« - - EE -w,-> 

t=i j=i i=i y=i / i=i j=i 

(5.15) 
with 

N / N \ N 

E< 7 <> E ■*,• = q ■ E< 7 <> = Q ■ < J > < 5 - 16 > 

i=l \j=l / i=l 

for Qi := (2j=i "A? ) = Q tne number of neighbours to site i, here assumed 
to be constant Q. 
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In more general, terms of the form 

N N 

(^-EE^ 1 ^- (5 - 17) 

*=1 3=1 

will appear with any V th power of the adjacency matrix, e.g. J?- = Ylk=\ ^ik^hji 
and respectively 

N N N 

(iii)^ := E E E J ?j J jk ■ w* ( 5 - 18 > 

i=l j=l fc=l 

and so on. 

19.5.2. Time evolution of marginals and local expectations 

For the marginals we can put forward some rules which are rigorous but also 
intuitively obtained from the master equation. 

The birth-death process (or equivalently the SIS-epidemics, and for a more 
general class of processes specified below) presents the following expressions 
for the dynamics of local quantities (like (/j) etc.) 

5«> - E-ti* w) 

/ 1= I N =0 



/ N 

E ME w ik,i-ik p(ii,».,i-h,-,iN,t) 

{/} \k=l 

N \ 

~E w i-ik,h p(h,-,h,—,lN,t) J 
fc=i / 



(5.19) 



using the definition ]T){/} := S/ a =o ■•• S/jv=o f° r tne ensemble average and 
by inserting the master equation for the time derivative of the probability. 
For any function /(/,-, Ij) we have 

]T f{IiJ 3 )p{h,..,l-Iu...,lN,t) = Y, f{l-h,I 3 )p{h,-Ji,-,lNit). 
{1} V) 

(5.20) 
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This is obtained from the elementary consideration 
l 

£/(Wi-/i,t) 

Ji=0 

= /(o)p(i,t) + /(i)p(o,0 

= /(l)p(0,t) + /(0)p(l,t) 
1 

= £/(l-J i )p(/i,i) 

/i=0 

and results in 

AT 

dt 



(li) = J2 Ti I S w i-4,4 p(Iu...,h,-,lN,t) 



AT 

~ ]C w i-h,ik p(h,-,h,-JN,t) 
+ Y1 i 1 - h) w i-iuh p(h,-,Ii,~.,lN,t) 

{1} 

-h wi-ji.jj p(A, ..., /*, ..., /jv, <) 

(5.21) 

= J^ tui-A,/! ((1 - h) - h ) p(h,...,Ii,...J N ,t) 

For the variable ij € {0, 1} we obtain the equations /? = i^ and (1 — 7j) 2 = 
(1 — 7j) and hence ij(l — ij) = so that for the birth-death process 



/„/<•((! -/<)-*) = 4t J ^ -((l-/0 2 -(l-/iKi) 



=(l-A) 
+a(j<(l-Ji)-.Z?) 

s v > 

(5.22) 

with a function w with additive birth and subtractive death term. 
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The equation 

wi-m ■ ((1 -/<)-/<)= W!- ItiIi (5.23) 

holds for general transition probabilities of the functional form 

wi-i t ,ii = Mljhti) ■ (1 - Ii) + 9({Ijh*i) ■ h (5-24) 

with arbitrary functions / for birth terms and g for death terms and w defined 
as 

wi-l uh := f({Ij}^i) • (1 ~ Ii) ~ 9({Ij}&i) ■ h ■ (5.25) 

Hence we obtain 

31 (Ii) = E ™i-h,ii p(h,-Ji,-,lN,t) 

dt {/} 

(5.26) 

N 

= bY^MW-Ii))-a(Ii) 

3=1 

N 

= b^JijiSJA-aili) 

where in the last line we used again S% := 1 — 1%. This provides an easy and 
intuitive way to calculate generally such dynamics of local expectation values. 

19.5.3. Moment equations 

For the total number (I) := 5Z»=i (Ii) we obtain the dynamics 

i « = t i « 

(5.27) 

N / N 

= E -«(/<> + &E ^((/j) - <Vi» 

»=>1 \ 3=1 

N N N N N 

= - a j2 (1^ +bj2 (h) E J a - 6 EE WW 

i=l j=l i=l i=l j=l 

=</> s =Q 3 =Q =(//)! 

= -a(I)+bQ(I)-b(II)i . 
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Hence 



(I) = b(Q(I)-(IIh)-a(I) 



dt 

(5.28) 

= b(SI)i-a{I) 



*N v^JV 



with (SI)! := Yh=! Ej=i -fy ($*>) = Q(I) ~ (U)i. To obtain the dynam- 
ics for the total number of pairs 






we have to calculate first ^(hlj) from the rules given above and using the 
master equation. We thereby have 



3l( 7 i 7 j) = Yl Iil i P(h,-,lN, t) 

= J2 (^(C 1 " ^) - A^i-aa + ii((i - /i) - />!-/„/, ) 

{/r v , 

see equation (5.21) 

■p{h,...,I N ,t) 
= (IjWi-i ilIt + IiWi- IitIi ) 

v v / 

see equation (5.22) 

= (Ij(bJ2 JikUl - Ii) - ah)) + (Ii(bJ2 JjkhV ~ Ij) ~ alj)) 
fc=i fc=i 

W TV 

= 6^ JikVjhO- - h)) - ailjU) + bJ2 JjkVMl - Ij)) - a(Iilj) 
fc=i fc=i 

(5.30) 
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Hence for the dynamics of nearest neighbour pairs we obtain 



i=l j=l 



t=i j=i 

JV N JV 

t=i j=i fe=i 

N N / N \ 

i=i fc=i \ j=i / 



(5.31) 



=:(J a )i 



Here ( J 2 )ij is the matrix J squared and then taken the ij "* element of that 
matrix J 2 . This last term gives a contribution of the form (11)2, see equation 
(5.17). 

In total we obtain for the pair dynamics 

^(II) 1 =2 b {(II) 2 -(III) l<l )-2a(II )l ^ 

= b(J5/>i,i-2o(J/) 1 

with (7S7) M := E?=i Ef=i E?=i WMW ~ W ~ k )- A § ain the 0DE 
for the nearest neighbours pair (II)\ involves higher moment terms like (11)2 
and <///)!,!. 

We now try to approximate the higher moments in terms of lower in order 
to close the ODE system. The quality of the approximation will depend on 
the actual parameters of the birth-death process, i.e. o and b. We first inves- 
tigate the mean field approximation, expressing (II)\ in terms of (7). Then 
other schemes to approximate higher moments are shown, like the BBGKY- 
approximation (after Bogolyubov, Born, Green, Kirkwood, Yvon). 

19.5.4. Mean field behaviour 

In mean field approximation, in the interaction term the exact number of 
inhabited neighbours is replaced by the average number of inhabitants in the 
full system, acting like a mean field on the actually considered site. Hence we 
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set 






/ „ Jkjlj " 2^ Jk i "jy" 

3=1 3=1 

N {1} ' 
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(5.33) 



and get for {II)\ in equation (5.35) 
N N 

JV AT 

= (E J 'E¥i) 



(5.34) 



iV W/ JV 



(E^-w) = ^< 7 >-<E 7 <> 



Q /r\2 



AT 



hence 



(/) = b (Q(I)-Q(I)^-a(I) 



dt 

(5.35) 

= *| (TV -(/))(/)-«(/) . 

For homogeneous mixing, i.e. the number of neighbours equals the total pop- 
ulation size Q = N, we obtain the logistic equation for the total number of 
inhabited sites 

j t (I) = b (N - (/»(/) - a(I) (5.36) 

or for the proportion ^ =: x € [0, 1] 

- — = iV6 ^l-_j— - a — (5.37) 

and hence 

c/x 

— = Nb (1 - x) • x - a • a: . (5.38) 
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19.5.5. Pair approximation 

For the simplest pair approximation scheme we obtain the closed ODE sys- 
tem 

!</) = b (Q(I) -<//)! )-o</> 

(5.39) 

d ,TT\ ~ Oh 2 MI1 - MO OntTT\ 

-{II), « 26- Ar _ (/) 2a(JJ> 1 

where the tripple appearing originally in the second ODE is approximated by 
pairs and singles. For further details on approximation schemes and simulation 
evaluations (see e.g. [Rand, 1999]) and references there. 

19.6 Directed percolation and path integrals 

For a long time it has been numerically established that simple birth-death 
processes for mutually excluding particles on a lattice belong in criticality to 
the universality class of directed percolation [Grassberger & de la Torre, 1979]. 
But only recently, attempts have started to describe such hard-core particles in 
a field theory [Park et al, 2000] and even more recently in a formalism easily 
treated analytically to obtain such field theories, i.e. bosonic theories [Wijland, 
2001]. Van Wijland uses ^-functions built from bose operators. 

We showthatthe <S-bosons used by [Wijland, 2001] can mimic the spin 1/2 
operators used in [Grassberger & de la Torre, 1979] and derive a path integral 
which can be compared to those analysed for directed percolation [Janssen, 
1981]. To make the link between such hard-core processes and directed perco- 
lation precise is especially important for modelling epidemics, which naturally 
happen in entities of uninfected or single infected individuals, e.g. in plant epi- 
demics plants on regular lattice points (see e.g. [Stollenwerk & Briggs, 2000]), 
or in animal and human epidemics on social network lattices (e.g. [Rand, 
1999]). 

19.6.1. Master equation of the birth-death-process 

One of the simplest and best studied spatial processes is the birth-death pro- 
cess with birth rate /3 and death rate aonJV sites, of which each can be either 
inhabited I := 1, or empty or solo S := 1, hence I = (in general S := 7-1). 
In this section, a and f3 will stand for death respectively birth rate, since a will 
be used for annihilators, as is convention in particle and stochastical physics. 

Translated into epidemiology, I is the infected, S the susceptible class, f5 the 
infection rate, a the recovery. We refer to it as SIS-system. The master equa- 
tion for the spatial SIS-system is for N lattice points using the master equation 
approach for a spatial system in a form as for example used in [Glauber, 1963] 
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for a spin dynamics, 

d N 

p{h,...,I N ,t) = ]T wi u i- h {t) p{h,...,l-I u ...,I N ,t) 






(6.1) 

AT 

for ij 6 {0, 1} and transition rate 

w Ut x- U = /? I E J y' 7 i )-h + a-(l-Ii) , (6.2) 

and 

u>i-j<,/, = P ( ]£ Jy/j • (1 - J<) + a ■ h , (6.3) 

with (3 birth or infection rate and a death or recovery rate. Here ( Jy ) is the ad- 
jacency matrix containing for no connection and 1 for a connection between 
sites i and j, hence Jy = J.,* G {0, 1} for ? ^ j and Jii = 0. 

The master equation can be transformed into a Schrodinger-like equation us- 
ing operators common in quantum theory ([Grassberger & Scheunert, 1980], 
[Peliti, 1985]), from which a path integral can be derived for the renormaliza- 
tion analysis. 

19.6.2. Schrodinger-like equation 

The master equation (6.1) can be written in the following form of a linear 
operator equation 

±Mt))=LMt)) (6.4) 

for a Liouville operator L to be calculated from the master equation (Equ. 
(5.1)) and with state vector |^(i)) defined by 

!*(*)> := E-E^i ^*)(^) / '-(^) / ''|0) 

/ 1= I N =0 

(6.5) 



{/} \i=l / 
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and vacuum state |0). The creation and annihilation operators are defined by 
c +|0) = |1) and c*|l) = |0), and (c + f |0) = Oand a\Q) = 0, hence 

a \U) = li-\i-ii) (6.6) 

4\Ii) = (1-Ii)-\1-Ii) (6.7) 

and (c/~) |i») = cf\Ii) = 0. We have anti-commutator rules on single lattice 
sites 

[d, c+]+ := ch4 + c+a = 1 (6.8) 

and ordinary commutators for different lattice sites i ^ j 

[a, c t]_ := acf - c+a = o (6.9) 

respectively 

[Ci,cj]-=Q , K+,c+]_=0 . (6.10) 

These are exactly the raising and lowering operators in [Brunei et al, 2000] 
with 

+ / 1 \ ( 



c = (o oj • C =U oj ' (611) 

for vectors 

|1)=(J) . |0)=(?) , (6-12) 

respectively product spaces of it for many particle systems as considered here. 
[Brunei et al, 2000] then use the Jordan- Wigner transformation to change to 
pure Fermi operators with anti-commutation on single sites and on different 
sites to obtain their path integrals. We use a different way. 
The dynamics is expressed by 

|i*w> = E (it Pi{Iht) ) n (tf ) 7i i°) = - = L i*w> (6 - 13 ) 

where the master equation has to be used to obtain the specific form of the 
operator L. The explicit calculations, here only denoted by ..., will be shown 
below. 

For the birth-death process (Equ. (5.1)) the Liouville operator is after some 
calculation 

N ( N \ N 

L = £> - «)P E J H c t c i c t + E( X - <?)<* d ■ ( 6 - 14 ) 
i=i \j=i / t=i 



The term (1 — Cj) guarantees the normalization of the master equation solution 

4 



and /3JijC + CjC + creates one infected at site i from a neighbour j which is 
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itself not altered. c^Cj is simply the number operator on site j. Furtheron, aci 
removes a particle from site i, again ensuring normalization with (1 — cf). 

This form Equ. (6.4) and Equ. (6.14) is exactly the form given as well in 
([Grassberger & de la Torre, 1979], there pp. 392-394, Appendix A), hence 
using the raising and lowering operators. 

However, it seems not an easy task to construct from such a Liouville opera- 
tor the path integral since no coherent states are constructed for the raising and 
lowering operators. Therefore, [Brunei et al, 2000] proceed from these spin 
1/2 operators to ferminon operators, using Grassman variables for the coher- 
ent states, whereas [Cardy & Tauber, 1998], use bose operators from the start 
for which coherent states are easily available (e.g. [Le Bellac, 1991], [Zinn- 
Justin, 1989]) hoping that rarely more than one particle will appear at a single 
site. But [Park et al, 2000] have emphasized once again the need for a rigorous 
fomulation in terms of hard core particles for which the exclusion principle on 
a single site is guaranteed and commutation on different sites as well. 

This aim can be achieved by constructing ^-operators for bosons [Wijland, 
2001], as will be demonstrated for our birth-death process now. 

19.6.3. 5 -bosons for hard-core particles 

Defining Bose operators a + and a for states \n) with n 6 No particles on 
one site by 

a + |n) :=|n + l> (6.15) 

and 

a\n) := n ■ \n — 1) (6.16) 

and the number operator h := a + a with 



a + a\n) = n\n) (6.17) 



we can use 8- functions 



&n,k\rn) = 8 mik \rn) (6.18) 

with a suitable representation, e.g. 



^ ~ 27T J_ n 



Mh-k) du (619) 



[Wijland, 2001]. 8 mt k is the ordinary Kroneker delta whereas <$^fc is an opera- 
tor defined by Equation (6.18). 

Then we obtain for the birth-death process the following Liouville operator 

N / N \ N 

L = IZk + - w 23 j <a„i s ni,o + J2( ai - 1 ) a «w • < 6 - 2 °) 

»=1 \j=l J t=l 
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which can be understood easily when replacing Sn it i in the bosonic theory by 
cfci in the spin 1/2 theory, and (5^,0 by (1 — cfci) and simply replacing ai by 
Cj and af by cf. Then evaluating the resulting Liouville operator in terms of 
the spin 1/2 commutation rules results exactly in Equ. (6.14) again. 

19.6.4. Path integral for hard-core particles in a 
birth-death process 

The path integral follows from integrating (6.4) 
M 
!*(*)> = II 0- + At ■ L(t - v ■ At)) \9(t M )) 

Hence with At — ► 0, M — * oo and the finite time interval At • M = t — to we 
obtain for any expectation value (/) defined as 

</>(*) == £/(W)p(W.t) = Wl*(*)> 
{n} 

with a Felderhof projection state (P| := (0|e^*=i a « [Felderhof, 1971 ] the path 
integral 

(P\f\<S(t)) = J ... J V^(i)V^(i)f(t)-e Jt ° ^K'" a * ) 
with 

again in the limit M — » oo and At — > 0. The field variables $!■ (i) and $ ^ (t) 
are introduced by coherent state integrals and replace the creation and anni- 
hilation operators by complex scalar variables. Here the Lagrange function 

is 

£(**,*)= f/?X>*l**l 2el ** P - a *i) -(**-l)e l$fc l 2 . (6.22) 

This compares well with the path integrals used as a starting point for further 
analysis of directed percolation [Janssen, 1981] when we only use the lowest 
order of $ in Taylor's expansion. Higher orders are expected to give irrelevant 
renormalization fields. 

The path integral is now ready for a further renormalization analysis (see 
[Cardy & Tauber, 1998]). On the numerical side the real space renormaliza- 
tion as initially described by [Ma, 1976] is promising for further progress in 
understanding the spatial birth-death process near criticality. In the following 
we give the derivations in more detail: 
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19.6.5. Product space for spin 1/2 many particle systems 

With single particle creation and annihilation operators 



c + := 



1 




and single particle state vectors 
|1):= 



c := 



|0>:= 





1 



(6.23) 



(6.24) 



|0):= 



(6.25) 



the corresponding two-particle system would be constructed as a product space 
with 4-dimensional state vectors and 4x4- matrices. Hence the vacuum state 

is (o\ 

1 


and a state containing one particle at site 1 and no particle at site 2, hence the 
state 1 1,0) is 





|1,0> = 





W 



(6.26) 



being created from cf\0) — cf\0, 0) = |1, 0) . Hence the creation operators 
for the two particles are the 4x4-matrices built from 2x2-matrices 



„+ 







r+ - 
c 2 _ 







(6.27) 



„+ _ 



(6.28) 



- 1 ~ V o 1| 

with 2x2-unit matrix 1|, or written out e.g. 

/ 1 \ 



10 

\ 1 / 

The other operators C\ and C2 follow directly from this, and commutation rules 
e.g. [c\, Cg"]- = can be shown easily. 

19.6.6. Path integral using coherent states for hard-core 
bosons 



The Schrodinger-like equation 

d 



-|*(t))= L|*(t)) 



(6.29) 
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with the Liouville operator 

N / N \ N 

can be integrated formally using At — ► oo from the quotient of differences 

~|*(t)> « ^ ( |*(t)> - |*(t - At)) ) = L(t - At) |*(t - At)) (6.31) 

showing 

|*(t)> = (1 + At • L(t - At)) |*(t - At)) 

and for several subsequent time steps 

M 
\V(t)) = Yl (1 + At • L(t - n • At)) |*(t - M • At)) 
u=l 

where t — M • At =: t s is the starting time of the stochastic process. 
With the Felderhof projection operator [Felderhof, 1971] 

(P\ := (0|e^^i ai (6.32) 

and the definition for the state vector 

!*(*)> = E " E P(h,..,I N >t) (J] (4) h ) 10) (6.33) 

/ 1= I N =0 \i=l / 

any measurable quantity A as a function of the state variables 7j in the master 
equation formulation, respectively number operator a^ai 

A := A(h, ..., I N ) = A{{h}) = A({af ai }) (6.34) 

using he notation {/j} := {Ii, ..., 1^} has for its expectation value 

(A)(t):=5>(tf})P({ 7 «}'*) (6.35) 

the following expressions 

^)(t) = (P|.4({a l + ,a i })|tf(t)) . (6.36) 

Again we use 

E-E-E • w) 

{Ii} h=0 I N =0 
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The path integral for an expectation value is then expressed by 



M 



(P\Am f )) = (P\A H(l + At • £„)|tt(t a )> (6-38) 



u=\ 



with final time tf and starting time t s and times t v such that to = tf and 
t]M = t$> L v := L{tij). 

With coherent states |$) := e*'° |0) and its completeness relation 

and abbreviation / d 2 <& := J^, /^ 27ri we nave ^ or ^ s ^ te w ^ tn °P erators 
a,-, a+ the completeness relation 



n = J (fid**(t u ) j e -Er.ii»i(Mi a | { ^. ( ^)}f =1 ) ({*,(^)}f =1 i 



withK^i,,)}") := 1*1^), ....^(M). 



(6.40) 



We now can introduce unit operators 1| in between every time slice of the 
path integral and then insert the completeness relations for the coherent states 



<P|A|*(t/)> = (P\A1\ \J[{l + At.L u ) l|J|*(t.)> 

= /(n^*(*/))<mi{*i(t/)}jLi> 



(6.41) 



/ nn 



({$,(^_ 1 )}f =1 | (l + At-L,)|{^(t,)}f =1 ) ) 
e -EjLil»i(*.)l a ({$,(t / )}f =1 |*(*,)) 
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considering the non-boundary terms 



// N M \ / M 



!l*i(^-l)| 2 



(6.42) 



({^(t,-i)}f =1 | (1 + At • L v )\{*i{tv))U) 



(6.43) 



further in the following. 
It is 

({^(U^)}f =1 \ (1 + At • L u )\{^(t u )}f =1 ) 

= e -Er=i*J(*-i)-*i(^) + At-Z„ 

with 

({^•(^_ 1 )>^ 1 j{^(^)}^ 1 > = c -sr-i •;<**-! wm (6 . 44) 

and 

L„ := {{^(t^i)}^! L„ |{^(t,)}f =1 ) 

AT 



A=l 



N / N \ 

fc=l \*=i / 

•($ fc (t,_ 1 )-l) e Sf= 1 ,^ *,(*,-!) *.(M 

(6.45) 
with 

({^■(t^)}^ a* rfn„i |{*i(t,)>iLi> - **(«„) • eSr-u^ •>(*-!) •*&•) 

(6.46) 

e£ cetera using the coherent state definition 

|{^(t,)}f =1 ) := e^ M'") 4 10) . (6.47) 

In this way we obtain completely the path integral as given above. 

19.7 Summary 

We have described epidemic processes near criticality, and have given anal- 
ysis for mean field models under homogeneous mixing conditions. In one case 
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we found that an epidemiological system evolves on its own towards critical- 
ity, hence self-organizes itself towards the critical state. For spatial systems we 
have presented the basic description of the master equation and have shown the 
connection with the previous sections under the explicit analysis of mean field 
assumptions. A complete analysis of the spatial system would reveal qualita- 
tively the same behaviour, in particular again power laws for the distributions 
of epidemics, but with different exponents. The detailed analysis via renormal- 
ization is still under debate. criticality,self organized 
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